Table of Contents
- cs.CL [Total: 23]
- cs.CV [Total: 87]
- cs.RO [Total: 4]
- cs.SE [Total: 1]
- cs.AI [Total: 8]
- cs.SD [Total: 1]
- cs.AR [Total: 1]
- cs.CR [Total: 1]
- astro-ph.IM [Total: 1]
- cs.MM [Total: 2]
- cs.LG [Total: 11]
cs.CL [Back]
[1] Recursive Language Models Meet Uncertainty: The Surprising Effectiveness of Self-Reflective Program Search for Long Context cs.CL | cs.AI | cs.LGPDF
Keivan Alizadeh, Parshin Shojaee, Minsik Cho, Mehrdad Farajtabar
TL;DR: 本文提出SRLM框架,通过不确定性感知的自反思机制增强程序化上下文交互,以解决语言模型在长上下文处理中的可靠性问题。该方法利用自一致性、推理长度和口头化置信度作为内在不确定性信号,评估和选择候选上下文交互程序,从而提升模型在长上下文任务中的表现。
Details
Motivation: 解决语言模型在长上下文处理中信息提取、推理和使用的可靠性不足问题,特别是针对递归语言模型(RLM)中上下文交互程序选择机制未充分探索的挑战。
Result: 在多种基准数据集、上下文长度和骨干模型上的实验表明,SRLM始终优于最先进的基线方法,在相同时间预算下比RLM提升高达22%,且在模型窗口内的上下文长度任务中,SRLM在短上下文和长上下文中均能带来一致增益。
Insight: 创新点在于引入基于内在不确定性信号(自一致性、推理长度、口头化置信度)的自反思程序搜索机制,无需自查询或显式递归即可匹配或超越RLM性能;客观分析表明,递归本身并非RLM性能的主要驱动力,而自反思提供的语义信号能更好地指导语义密集型任务的推理。
Abstract: Long-context handling remains a core challenge for language models: even with extended context windows, models often fail to reliably extract, reason over, and use the information across long contexts. Recent works like Recursive Language Models (RLM) have approached this challenge by agentic way of decomposing long contexts into recursive sub-calls through programmatic interaction at inference. While promising, the success of RLM critically depends on how these context-interaction programs are selected, which has remained largely unexplored. In this paper, we study this problem and introduce SRLM, a framework that augments programmatic context interaction with uncertainty-aware Self-Reflection. SRLM leverages three intrinsic signals: self consistency, reasoning length, and verbalized confidence. These serve as complementary indicators of a model’s internal uncertainty, and the model uses them to evaluate and compare candidate context-interaction programs. Extensive experiments across diverse benchmark datasets, context lengths, and backbone models, show that SRLM consistently outperforms state-of-the-art baselines, yielding up to 22% improvement over RLM under the same time budget. Our findings show that recursion itself is not the primary driver of performance in RLM, and a simple self-reflective program search can match or surpass RLM without requiring self-query or explicit recursion mechanisms. We find that for context lengths within the model’s window, RLMs with recursion often degrade performance relative to the base model, whereas SRLM yields consistent gains across both short and long contexts. We also find that RLM is less effective in tasks with semantically intensive nature, where heuristic program search is insufficient and broader contextual understanding is required, while self-reflection in SRLM provides a semantic signal that better steers reasoning in these scenarios.
[2] MiroThinker-1.7 & H1: Towards Heavy-Duty Research Agents via Verification cs.CL | cs.AI | cs.IR | cs.LGPDF
MiroMind Team, S. Bai, L. Bing, L. Lei, R. Li
TL;DR: 本文介绍了MiroThinker-1.7和MiroThinker-H1两个研究智能体。MiroThinker-1.7通过强调结构化规划、上下文推理和工具交互的智能体中期训练,提升了复杂长程推理任务中每一步交互的可靠性。MiroThinker-H1在此基础上,通过在推理过程中引入局部和全局验证机制,进一步增强了重型推理能力,实现了更可靠的多步骤问题求解。
Details
Motivation: 为了解决复杂长程推理任务中多步骤交互的可靠性问题,并构建能够进行重型、可靠推理的研究智能体。
Result: 在涵盖开放网络研究、科学推理和金融分析的基准测试中,MiroThinker-H1在深度研究任务上取得了最先进的性能,同时在特定领域也保持了强劲的结果。
Insight: 主要创新点在于:1)通过强调结构化规划等要素的智能体中期训练来提升单步可靠性;2)将局部(中间决策评估与优化)和全局(整体推理轨迹审计)验证机制直接集成到推理过程中,以确保最终答案有连贯的证据链支持。这为构建可靠的、多步骤的复杂任务求解智能体提供了可借鉴的框架。
Abstract: We present MiroThinker-1.7, a new research agent designed for complex long-horizon reasoning tasks. Building on this foundation, we further introduce MiroThinker-H1, which extends the agent with heavy-duty reasoning capabilities for more reliable multi-step problem solving. In particular, MiroThinker-1.7 improves the reliability of each interaction step through an agentic mid-training stage that emphasizes structured planning, contextual reasoning, and tool interaction. This enables more effective multi-step interaction and sustained reasoning across complex tasks. MiroThinker-H1 further incorporates verification directly into the reasoning process at both local and global levels. Intermediate reasoning decisions can be evaluated and refined during inference, while the overall reasoning trajectory is audited to ensure that final answers are supported by coherent chains of evidence. Across benchmarks covering open-web research, scientific reasoning, and financial analysis, MiroThinker-H1 achieves state-of-the-art performance on deep research tasks while maintaining strong results on specialized domains. We also release MiroThinker-1.7 and MiroThinker-1.7-mini as open-source models, providing competitive research-agent capabilities with significantly improved efficiency.
[3] COGNAC at SemEval-2026 Task 5: LLM Ensembles for Human-Level Word Sense Plausibility Rating in Challenging Narratives cs.CL | cs.AIPDF
Azwad Anjum Islam, Tisa Islam Erana
TL;DR: 本文介绍了参加SemEval-2026 Task 5竞赛的系统COGNAC,该任务要求对短篇故事中同形异义词的给定词义在5点李克特量表上进行合理性评分。作者探索了三种基于闭源商业大语言模型的提示策略,并通过集成模型预测来应对标注者间的显著差异。最佳官方系统在排行榜上排名第四,赛后实验进一步提升了性能。
Details
Motivation: 解决在具有挑战性的叙事文本中,对同形异义词的词义进行人类级别合理性评分的任务,并应对该主观语义评估任务中存在的显著标注者间差异。
Result: 官方最佳系统(集成所有三种提示策略的LLM)在竞赛排行榜上排名第四,准确率为0.88,斯皮尔曼等级相关系数为0.83(平均0.86)。赛后实验使用额外模型将性能提升至准确率0.92,斯皮尔曼等级相关系数0.85(平均0.89)。
Insight: 论文宣称的创新点在于探索了三种提示策略(零样本、思维链式结构化推理、比较性提示)并提出了集成模型预测的方法以应对标注者差异。客观分析认为,其核心创新在于发现比较性提示能持续提升性能,且模型集成能显著增强与人类平均判断的一致性,这为涉及多标注者的主观语义评估任务提供了有效的LLM集成方案。
Abstract: We describe our system for SemEval-2026 Task 5, which requires rating the plausibility of given word senses of homonyms in short stories on a 5-point Likert scale. Systems are evaluated by the unweighted average of accuracy (within one standard deviation of mean human judgments) and Spearman Rank Correlation. We explore three prompting strategies using multiple closed-source commercial LLMs: (i) a baseline zero-shot setup, (ii) Chain-of-Thought (CoT) style prompting with structured reasoning, and (iii) a comparative prompting strategy for evaluating candidate word senses simultaneously. Furthermore, to account for the substantial inter-annotator variation present in the gold labels, we propose an ensemble setup by averaging model predictions. Our best official system, comprising an ensemble of LLMs across all three prompting strategies, placed 4th on the competition leaderboard with 0.88 accuracy and 0.83 Spearman’s rho (0.86 average). Post-competition experiments with additional models further improved this performance to 0.92 accuracy and 0.85 Spearman’s rho (0.89 average). We find that comparative prompting consistently improved performance across model families, and model ensembling significantly enhanced alignment with mean human judgments, suggesting that LLM ensembles are especially well suited for subjective semantic evaluation tasks involving multiple annotators.
[4] BANGLASOCIALBENCH: A Benchmark for Evaluating Sociopragmatic and Cultural Alignment of LLMs in Bangladeshi Social Interaction cs.CLPDF
Tanvir Ahmed Sijan, S. M Golam Rifat, Pankaj Chowdhury Partha, Md. Tanjeed Islam, Md. Musfique Anwar
TL;DR: 该论文介绍了BANGLASOCIALBENCH,这是首个针对孟加拉语社会语用能力评估的基准测试,专注于评估大语言模型在孟加拉社会互动中的文化对齐性,而非事实性回忆。基准涵盖孟加拉语称谓术语、亲属关系推理和社会习俗三个领域,包含1,719个由母语者编写和验证的文化实例。在零样本设置下评估了12个当代大语言模型,发现模型存在系统性的文化错位模式。
Details
Motivation: 大语言模型虽展现出强大的多语言流畅性,但流畅性本身不能保证语言使用的社会适当性。孟加拉语作为高语境语言,其交流能力要求对直接编码在日常语言中的社会等级、关系角色和互动规范敏感,这包括其三级代词系统、基于亲属关系的称谓和文化嵌入的社会习俗。现有模型在这些方面的能力尚未得到充分评估。
Result: 在零样本设置下评估了12个当代LLMs,结果显示模型存在系统性的文化错位:经常默认使用过于正式的称谓形式,未能识别多个社会可接受的地址代词,并在不同宗教背景下混淆亲属术语。这些发现表明社会语用失败往往是结构性和非随机的。
Insight: 论文的创新点在于创建了首个针对孟加拉语社会语用能力的基准测试,强调通过上下文依赖的语言使用来评估文化对齐性,而非事实性知识。从客观角度看,该研究揭示了当前LLMs在推断和应用现实社会互动中文化适当语言使用方面的持续局限性,为多语言AI的社会文化对齐评估提供了新视角和方法论。基准的构建由母语者参与,增强了文化真实性和可靠性。
Abstract: Large Language Models have demonstrated strong multilingual fluency, yet fluency alone does not guarantee socially appropriate language use. In high-context languages, communicative competence requires sensitivity to social hierarchy, relational roles, and interactional norms that are encoded directly in everyday language. Bangla exemplifies this challenge through its three-tiered pronominal system, kinship-based addressing, and culturally embedded social customs. We introduce BANGLASOCIALBENCH, the first benchmark designed to evaluate sociopragmatic competence in Bangla through context-dependent language use rather than factual recall. The benchmark spans three domains: Bangla Address Terms, Kinship Reasoning, and Social Customs, and consists of 1,719 culturally grounded instances written and verified by native Bangla speakers. We evaluate twelve contemporary LLMs in a zero-shot setting and observe systematic patterns of cultural misalignment. Models frequently default to overly formal address forms, fail to recognize multiple socially acceptable address pronouns, and conflate kinship terminology across religious contexts. Our findings show that sociopragmatic failures are often structured and non-random, revealing persistent limitations in how current LLMs infer and apply culturally appropriate language use in realistic Bangladeshi social interactions.
[5] MoLoRA: Composable Specialization via Per-Token Adapter Routing cs.CL | cs.AIPDF
Shrey Shah, Justin Wagle
TL;DR: 本文提出了一种名为MoLoRA(Mixture of LoRA)的每令牌适配器路由方法,以解决传统多适配器服务系统将整个序列路由到单一适配器的局限性。该方法允许在推理时根据词汇结构或学习到的门控机制,将单个令牌路由到不同的领域专用适配器,从而实现可组合的专业化。
Details
Motivation: 传统多适配器系统假设整个序列属于单一领域,这在多模态生成(如文本和图像令牌需要不同适配器)和混合能力请求(如需要多个专业适配器知识的任务)中失效。本文旨在解决这些场景下适配器路由的灵活性问题。
Result: 实验表明,MoLoRA使Qwen3-1.7B模型在四个推理基准测试中超越了Qwen3-8B模型,同时模型规模小了4.7倍,证明了专业化可以显著超越规模扩展的优势。
Insight: 核心创新点是提出了每令牌路由机制,它被证明在理论上是最优的(计算量为N,而每序列路由为K·N),并实现了MoLoRA框架,支持模块化专业知识:独立训练聚焦的LoRA适配器,无需重新训练即可组合它们,并通过简单加载新适配器来添加新能力。
Abstract: Multi-adapter serving systems route entire sequences to a single adapter, forcing a choice when requests span multiple domains. This assumption fails in two important settings: (1) multimodal generation, where text and image tokens require different adapters within the same sequence, and (2) mixed-capability requests like “write code to solve this equation,” which need expertise from multiple specialized adapters. We introduce per-token routing, which routes individual tokens to adapters based on either vocabulary structure (for multimodal models) or learned gating (for semantic specialization). Per-token routing is provably optimal, achieving work N for N tokens versus K \cdot N for per-sequence routing with K adapter types. Our key contribution is MoLoRA (Mixture of LoRA), which enables composable specialization: load multiple domain-specific adapters and let a learned router select the appropriate adapter per-token. We demonstrate that specialization dramatically beats scale: MoLoRA enables Qwen3-1.7B to exceed Qwen3-8B across four reasoning benchmarks while being 4.7x smaller. This enables modular expertise at inference time: train focused LoRAs independently, combine them without retraining, and add new capabilities by simply loading new adapters.
[6] Aligning Paralinguistic Understanding and Generation in Speech LLMs via Multi-Task Reinforcement Learning cs.CL | cs.AIPDF
Jingxiang Chen, Minseok Kim, Seong-Gyun Leem, Yin Huang, Rashi Rungta
TL;DR: 本文提出了一种通过多任务强化学习对齐语音大语言模型中副语言理解与生成的方法,旨在解决副语言线索(如韵律、情感、非语言声音)利用不足的问题。该方法采用思维链提示激发显式情感推理,并引入副语言感知的语音LLM(PALLM),通过两阶段管道联合优化音频情感分类和副语言感知的响应生成。
Details
Motivation: 动机在于语音大语言模型在利用副语言线索进行意图理解时面临训练数据有限、标注困难以及模型倾向于利用词汇捷径而非副语言信号的挑战。
Result: 在Expresso、IEMOCAP和RAVDESS基准测试中,该方法相比监督基线和强私有模型(Gemini-2.5-Pro、GPT-4o-audio)将副语言理解性能提升了8-12%,达到了SOTA水平。
Insight: 创新点包括使用多任务强化学习结合思维链提示来显式建模副语言推理,以及设计两阶段管道联合优化分类和生成任务,这为构建情感智能语音LLM提供了关键思路。
Abstract: Speech large language models (LLMs) observe paralinguistic cues such as prosody, emotion, and non-verbal sounds–crucial for intent understanding. However, leveraging these cues faces challenges: limited training data, annotation difficulty, and models exploiting lexical shortcuts over paralinguistic signals. We propose multi-task reinforcement learning (RL) with chain-of-thought prompting that elicits explicit affective reasoning. To address data scarcity, we introduce a paralinguistics-aware speech LLM (PALLM) that jointly optimizes sentiment classification from audio and paralinguistics-aware response generation via a two-stage pipeline. Experiments demonstrate that our approach improves paralinguistics understanding over both supervised baselines and strong proprietary models (Gemini-2.5-Pro, GPT-4o-audio) by 8-12% on Expresso, IEMOCAP, and RAVDESS. The results show that modeling paralinguistic reasoning with multi-task RL is crucial for building emotionally intelligent speech LLMs.
[7] Understanding Moral Reasoning Trajectories in Large Language Models: Toward Probing-Based Explainability cs.CL | cs.AIPDF
Fan Huang, Haewoon Kwak, Jisun An
TL;DR: 该论文提出了“道德推理轨迹”的概念,用于分析大型语言模型在道德敏感决策过程中,其内部伦理框架在连续推理步骤中的调用与切换动态。研究发现,模型的道德推理涉及多框架的系统性审议,轨迹不稳定且易受攻击,并揭示了伦理框架在模型特定层中的编码位置。论文还提出了一个与模型连贯性评分强相关的道德表征一致性指标。
Details
Motivation: 大型语言模型越来越多地参与道德敏感的决策,但其在推理步骤中如何组织和切换伦理框架尚不明确,需要一种基于探针的可解释性方法来理解其道德推理过程。
Result: 在六个模型和三个基准测试上的分析表明,55.4-57.7%的连续推理步骤涉及伦理框架切换,仅16.4-17.8%的轨迹保持框架一致;不稳定的轨迹对说服性攻击的敏感性高1.29倍。线性探针在模型特定层(如Llama-3.3-70B的第63层)实现了比基线低13.8-22.6%的KL散度。提出的MRC指标与LLM连贯性评分的相关性高达0.715。
Insight: 创新点在于提出了“道德推理轨迹”这一分析视角,并利用线性探针定位了伦理框架在模型深层表示中的编码位置;提出的MRC指标为评估模型道德推理的连贯性提供了一个可量化的、与人类判断高度一致的度量方法,增强了模型道德决策的可解释性。
Abstract: Large language models (LLMs) increasingly participate in morally sensitive decision-making, yet how they organize ethical frameworks across reasoning steps remains underexplored. We introduce \textit{moral reasoning trajectories}, sequences of ethical framework invocations across intermediate reasoning steps, and analyze their dynamics across six models and three benchmarks. We find that moral reasoning involves systematic multi-framework deliberation: 55.4–57.7% of consecutive steps involve framework switches, and only 16.4–17.8% of trajectories remain framework-consistent. Unstable trajectories remain 1.29$\times$ more susceptible to persuasive attacks ($p=0.015$). At the representation level, linear probes localize framework-specific encoding to model-specific layers (layer 63/81 for Llama-3.3-70B; layer 17/81 for Qwen2.5-72B), achieving 13.8–22.6% lower KL divergence than the training-set prior baseline. Lightweight activation steering modulates framework integration patterns (6.7–8.9% drift reduction) and amplifies the stability–accuracy relationship. We further propose a Moral Representation Consistency (MRC) metric that correlates strongly ($r=0.715$, $p<0.0001$) with LLM coherence ratings, whose underlying framework attributions are validated by human annotators (mean cosine similarity $= 0.859$).
[8] CounterRefine: Answer-Conditioned Counterevidence Retrieval for Inference-Time Knowledge Repair in Factual Question Answering cs.CL | cs.AIPDF
Tianyi Huang, Ying Kai Deng
TL;DR: 本文提出了CounterRefine,一个用于事实性问答的轻量级推理时知识修复层。该方法首先生成一个初步答案,然后基于该答案进行后续查询以收集支持和冲突证据,最后通过一个受限的细化步骤决定是保留还是修订答案,修订仅在通过确定性验证后才被接受。
Details
Motivation: 解决事实性问答中常见的‘承诺失败’问题,即系统检索到了相关证据但仍给出错误答案,旨在通过推理时检索和验证来修复知识。
Result: 在SimpleQA基准测试上,CounterRefine将匹配的GPT-5 Baseline-RAG提升了5.8个百分点,达到73.1%的正确率,并比报告的单样本GPT-5.4分数高出约40分。
Insight: 创新点在于将检索机制从单纯收集上下文转变为测试临时答案的工具,通过答案条件化的反证据检索和确定性验证来实现推理时的自我修正,为知识型基础模型提供了利用证据进行自我反思和修复的新方向。
Abstract: In factual question answering, many errors are not failures of access but failures of commitment: the system retrieves relevant evidence, yet still settles on the wrong answer. We present CounterRefine, a lightweight inference-time repair layer for retrieval-grounded question answering. CounterRefine first produces a short answer from retrieved evidence, then gathers additional support and conflicting evidence with follow-up queries conditioned on that draft answer, and finally applies a restricted refinement step that outputs either KEEP or REVISE, with proposed revisions accepted only if they pass deterministic validation. In effect, CounterRefine turns retrieval into a mechanism for testing a provisional answer rather than merely collecting more context. On the full SimpleQA benchmark, CounterRefine improves a matched GPT-5 Baseline-RAG by 5.8 points and reaches a 73.1 percent correct rate, while exceeding the reported one-shot GPT-5.4 score by roughly 40 points. These findings suggest a simple but important direction for knowledgeable foundation models: beyond accessing evidence, they should also be able to use that evidence to reconsider and, when necessary, repair their own answers.
[9] ASDA: Automated Skill Distillation and Adaptation for Financial Reasoning cs.CL | cs.AI | cs.CEPDF
Tik Yu Yim, Wenting Tan, Sum Yee Chan, Tak-Wah Lam, Siu Ming Yiu
TL;DR: 本文提出了一种名为ASDA(自动化技能蒸馏与适应)的框架,用于在不修改模型权重的情况下,通过迭代式错误纠正学习自动生成结构化技能文件,以提升大型语言模型在金融推理任务上的性能。
Details
Motivation: 现有方法(如GEPA和ACE)在金融推理基准FAMMA上提升有限,暴露了无结构文本优化在复杂多步骤领域推理中的局限性,因此需要一种无需训练且能产生结构化领域知识的适应方法。
Result: 在FAMMA基准测试中,ASDA在算术推理任务上取得了高达+17.33%的提升,在非算术推理任务上取得了+5.95%的提升,显著优于所有无需训练的基线方法。
Insight: 创新点在于通过教师模型分析学生模型的失败案例,按子领域和错误类型聚类,并合成包含推理过程、代码模板和示例的结构化技能文件,这些文件在推理时动态注入,具有可读性、版本控制性,并兼容Agent Skills开放标准,为领域适应提供了一条无需权重访问或重新训练的可审计路径。
Abstract: Adapting large language models (LLMs) to specialized financial reasoning typically requires expensive fine-tuning that produces model-locked expertise. Training-free alternatives have emerged, yet our experiments show that leading methods (GEPA and ACE) achieve only marginal gains on the FAMMA financial reasoning benchmark, exposing the limits of unstructured text optimization for complex, multi-step domain reasoning. We introduce Automated Skill Distillation and Adaptation (ASDA), a framework that automatically generates structured skill artifacts through iterative error-corrective learning without modifying model weights. A teacher model analyzes a student model’s failures on financial reasoning tasks, clusters errors by subfield and error type, and synthesizes skill files containing reasoning procedures, code templates, and worked examples, which are dynamically injected during inference. Evaluated on FAMMA, ASDA achieves up to +17.33% improvement on arithmetic reasoning and +5.95% on non-arithmetic reasoning, substantially outperforming all training-free baselines. The resulting skill artifacts are human-readable, version-controlled, and compatible with the Agent Skills open standard, offering any organization with a labeled domain dataset a practical and auditable path to domain adaptation without weight access or retraining.
[10] SIA: A Synthesize-Inject-Align Framework for Knowledge-Grounded and Secure E-commerce Search LLMs with Industrial Deployment cs.CLPDF
Zhouwei Zhai, Mengxiang Chen, Anmeng Zhang
TL;DR: 本文提出了SIA(合成-注入-对齐)框架,旨在解决电商搜索大语言模型在工业部署中面临的知识幻觉和安全性漏洞两大挑战。该框架通过合成高质量自然语言语料、参数高效预训练以及双路径对齐方法,构建了知识丰富且安全的电商搜索LLM,并已在京东平台成功部署,A/B测试验证了其业务效果和可扩展性。
Details
Motivation: 解决电商搜索LLM工业部署中的两大关键问题:一是动态细粒度产品知识编码不足导致的知识幻觉,二是在越狱攻击下存在的安全漏洞威胁合规性。
Result: 在京东五个核心搜索场景的A/B测试中,关键业务指标显著提升,验证了框架的工业有效性和可扩展性。
Insight: 创新点包括:结合知识图谱与行为日志合成高质量语料并增强推理链和安全意识数据;基于深度上采样的参数高效预训练策略,在注入领域知识的同时保持通用能力;通过多任务指令微调和对抗训练的双路径对齐方法,同时提升任务性能和安全性鲁棒性。
Abstract: Large language models offer transformative potential for e-commerce search by enabling intent-aware recommendations. However, their industrial deployment is hindered by two critical challenges: (1) knowledge hallucination due to insufficient encoding of dynamic, fine-grained product knowledge, and (2) security vulnerabilities under jailbreak attacks that threaten compliance. To address these issues, we propose SI–a Synthesize-Inject-Align framework for building knowledgeable and secure e-commerce search LLMs. Our approach first synthesizes high-quality natural language corpus by combining structured knowledge graphs with unstructured behavioral logs, augmented with reasoning chains and safety-aware data.We then introduce a parameter-efficient pre-training strategy based on Depth Up-Scaling to inject domain knowledge while preserving general capabilities. Finally, a dual-path alignment method via multi-task instruction tuning and adversarial training strengthens both task performance and safety robustness. The framework has been deployed at JD.com, China’s largest self-operated e-commerce platform, where A/B tests across five core search scenarios demonstrate significant improvements in key business metrics, validating its industrial effectiveness and scalability.
[11] Structured Semantic Cloaking for Jailbreak Attacks on Large Language Models cs.CLPDF
Xiaobing Sun, Perry Lam, Shaohua Li, Zizhou Wang, Rick Siow Mong Goh
TL;DR: 本文提出了一种名为结构化语义伪装(S2C)的新型多维越狱攻击框架,旨在绕过现代大语言模型(LLM)基于潜在语义表示和生成时推理的安全机制。S2C通过策略性地分布和重塑语义线索,延迟和重组恶意意图在模型推理过程中的整合,从而降低安全触发器的有效性。
Details
Motivation: 现代LLM的安全机制已从表层输入过滤扩展到潜在语义表示和生成时推理,能够恢复推理过程中被混淆的恶意意图并拒绝响应,这使得许多表层混淆越狱攻击失效。因此,需要一种新方法来操纵恶意语义意图在模型推理中的重构过程。
Result: 在HarmBench和JBB-Behaviors基准测试上,S2C分别将攻击成功率(ASR)比当前SOTA提高了12.4%和9.7%。特别是在GPT-5-mini上,S2C在JBB-Behaviors上比最强基线高出26%。
Insight: 创新点在于提出了一个多维攻击框架,通过上下文重构、内容碎片化和线索引导伪装三种互补机制,系统性地延迟和干扰模型对恶意语义的整合过程。客观来看,其核心洞察是将攻击重点从表层文本混淆转移到操纵模型内部推理的语义整合路径上,并研究了混淆程度与输入可恢复性之间的权衡。
Abstract: Modern LLMs employ safety mechanisms that extend beyond surface-level input filtering to latent semantic representations and generation-time reasoning, enabling them to recover obfuscated malicious intent during inference and refuse accordingly, and rendering many surface-level obfuscation jailbreak attacks ineffective. We propose Structured Semantic Cloaking (S2C), a novel multi-dimensional jailbreak attack framework that manipulates how malicious semantic intent is reconstructed during model inference. S2C strategically distributes and reshapes semantic cues such that full intent consolidation requires multi-step inference and long-range co-reference resolution within deeper latent representations. The framework comprises three complementary mechanisms: (1) Contextual Reframing, which embeds the request within a plausible high-stakes scenario to bias the model toward compliance; (2) Content Fragmentation, which disperses the semantic signature of the request across disjoint prompt segments; and (3) Clue-Guided Camouflage, which disguises residual semantic cues while embedding recoverable markers that guide output generation. By delaying and restructuring semantic consolidation, S2C degrades safety triggers that depend on coherent or explicitly reconstructed malicious intent at decoding time, while preserving sufficient instruction recoverability for functional output generation. We evaluate S2C across multiple open-source and proprietary LLMs using HarmBench and JBB-Behaviors, where it improves Attack Success Rate (ASR) by 12.4% and 9.7%, respectively, over the current SOTA. Notably, S2C achieves substantial gains on GPT-5-mini, outperforming the strongest baseline by 26% on JBB-Behaviors. We also analyse which combinations perform best against broad families of models, and characterise the trade-off between the extent of obfuscation versus input recoverability on jailbreak success.
[12] SpecSteer: Synergizing Local Context and Global Reasoning for Efficient Personalized Generation cs.CLPDF
Hang Lv, Sheng Liang, Hao Wang, Yongyue Zhang, Hongchao Gu
TL;DR: 本文提出SpecSteer,一种非对称协同推理框架,旨在解决个性化智能中的隐私与能力矛盾:通过将本地设备上的私有上下文与云端大规模模型的推理能力相结合,采用贝叶斯知识融合和分布式对齐协议,实现了草稿-验证-恢复的流水线,在保护隐私的同时提升了生成质量与效率。
Details
Motivation: 解决个性化智能的核心困境:将用户历史发送到中心化大语言模型存在隐私风险,而设备端小语言模型又缺乏高质量生成所需的推理能力。
Result: 实验表明,SpecSteer成功弥合了推理差距,实现了优越的个性化生成性能,同时相比标准基线带来了2.36倍的加速。
Insight: 创新点在于将协同推理形式化为贝叶斯知识融合,并重新利用推测解码作为分布式对齐协议,实现了逻辑验证与私有上下文的解耦,从而在保护隐私的同时进行有效纠错和意图注入。
Abstract: Realizing personalized intelligence faces a core dilemma: sending user history to centralized large language models raises privacy concerns, while on-device small language models lack the reasoning capacity required for high-quality generation. Our pilot study shows that purely local enhancements remain insufficient to reliably bridge this gap. We therefore propose SpecSteer, an asymmetric collaborative inference framework that synergizes private on-device context with cloud-scale reasoning. SpecSteer casts collaboration as Bayesian knowledge fusion and repurposes speculative decoding as a distributed alignment protocol, yielding a Draft–Verify–Recover pipeline: the on-device model drafts personalized sequences; the cloud validates via a ratio-based mechanism that decouples reasoning verification from private context, filtering logical flaws without accessing raw user context; upon rejection, a steering recovery injects local intent during correction. Experiments demonstrate that SpecSteer successfully closes the reasoning gap and achieves superior personalized generation performance, while delivering a 2.36x speedup over standard baselines.
[13] Fanar 2.0: Arabic Generative AI Stack cs.CL | cs.AIPDF
FANAR TEAM, Ummar Abbas, Mohammad Shahmeer Ahmad, Minhaj Ahmad, Abdulaziz Al-Homaid
TL;DR: Fanar 2.0是卡塔尔第二代以阿拉伯语为中心的生成式AI平台,其核心是Fanar-27B大语言模型。该平台在资源有限(仅使用256个H100 GPU,且阿拉伯语网络数据稀缺)的条件下,通过强调数据质量而非数量、定向持续预训练和模型合并等策略,显著提升了模型在阿拉伯语知识、语言、方言及英语能力上的性能。平台还扩展了包括内容审核、语音识别、视觉理解、图像生成、多智能体工作流、伊斯兰内容处理和诗歌生成在内的丰富能力栈,并通过多层编排器协调所有组件,展示了主权化、资源受限的AI开发也能构建出具有竞争力的系统。
Details
Motivation: 解决阿拉伯语在生成式AI中代表性不足的问题(尽管有4亿母语者,但网络数据仅占约0.5%),并基于主权化设计原则,在资源受限的条件下构建一个全面、高性能的阿拉伯语生成式AI平台。
Result: 核心模型Fanar-27B在阿拉伯语知识(+9.1分)、语言(+7.3分)、方言(+3.5分)和英语能力(+7.6分)上相比前代有显著提升,且预训练token数量减少了8倍。平台还包含了多个达到先进水平(SOTA)的组件,如用于内容安全审核的FanarGuard(4B参数双语过滤器)。
Insight: 创新点在于将主权化、资源效率作为核心设计原则,通过数据质量优先、定向持续预训练和模型合并策略,在有限算力和数据下实现性能突破;并构建了一个覆盖文本、语音、视觉、多模态及特定文化领域(如伊斯兰内容、古典诗歌)的完整生成式AI能力栈,通过意图感知路由和深度防御安全验证进行协调,为小语种和资源受限的AI开发提供了可借鉴的范例。
Abstract: We present Fanar 2.0, the second generation of Qatar’s Arabic-centric Generative AI platform. Sovereignty is a first-class design principle: every component, from data pipelines to deployment infrastructure, was designed and operated entirely at QCRI, Hamad Bin Khalifa University. Fanar 2.0 is a story of resource-constrained excellence: the effort ran on 256 NVIDIA H100 GPUs, with Arabic having only ~0.5% of web data despite 400 million native speakers. Fanar 2.0 adopts a disciplined strategy of data quality over quantity, targeted continual pre-training, and model merging to achieve substantial gains within these constraints. At the core is Fanar-27B, continually pre-trained from a Gemma-3-27B backbone on a curated corpus of 120 billion high-quality tokens across three data recipes. Despite using 8x fewer pre-training tokens than Fanar 1.0, it delivers substantial benchmark improvements: Arabic knowledge (+9.1 pts), language (+7.3 pts), dialects (+3.5 pts), and English capability (+7.6 pts). Beyond the core LLM, Fanar 2.0 introduces a rich stack of new capabilities. FanarGuard is a state-of-the-art 4B bilingual moderation filter for Arabic safety and cultural alignment. The speech family Aura gains a long-form ASR model for hours-long audio. Oryx vision family adds Arabic-aware image and video understanding alongside culturally grounded image generation. An agentic tool-calling framework enables multi-step workflows. Fanar-Sadiq utilizes a multi-agent architecture for Islamic content. Fanar-Diwan provides classical Arabic poetry generation. FanarShaheen delivers LLM-powered bilingual translation. A redesigned multi-layer orchestrator coordinates all components through intent-aware routing and defense-in-depth safety validation. Taken together, Fanar 2.0 demonstrates that sovereign, resource-constrained AI development can produce systems competitive with those built at far greater scale.
[14] IndexRAG: Bridging Facts for Cross-Document Reasoning at Index Time cs.CL | cs.AI | cs.IRPDF
Zhenghua Bao, Yi Shi
TL;DR: IndexRAG是一种新颖的检索增强生成方法,它将跨文档推理从在线推理阶段转移到离线索引阶段。该方法通过识别文档间的桥接实体并生成可独立检索的桥接事实单元,无需额外训练或微调,在推理时仅需单次检索和一次LLM调用。
Details
Motivation: 解决现有RAG方法在多跳问答任务中面临的挑战,即要么需要额外的在线图处理,要么依赖迭代的多步推理,旨在通过离线处理简化推理流程并提升效率。
Result: 在HotpotQA、2WikiMultiHopQA和MuSiQue三个多跳QA基准测试上,IndexRAG相比Naive RAG平均F1提升4.6分;与IRCoT结合后,平均性能优于所有基于图的基线方法(如HippoRAG和FastGraphRAG),且仅依赖扁平检索。
Insight: 核心创新在于将跨文档关系(桥接事实)的构建和索引提前到离线阶段,将复杂的多跳推理问题转化为单次检索问题,从而在保持高性能的同时显著降低了在线推理的复杂度和延迟。
Abstract: Multi-hop question answering (QA) requires reasoning across multiple documents, yet existing retrieval-augmented generation (RAG) approaches address this either through graph-based methods requiring additional online processing or iterative multi-step reasoning. We present IndexRAG, a novel approach that shifts cross-document reasoning from online inference to offline indexing. IndexRAG identifies bridge entities shared across documents and generates bridging facts as independently retrievable units, requiring no additional training or fine-tuning. Experiments on three widely-used multi-hop QA benchmarks (HotpotQA, 2WikiMultiHopQA, MuSiQue) show that IndexRAG improves F1 over Naive RAG by 4.6 points on average, while requiring only single-pass retrieval and a single LLM call at inference time. When combined with IRCoT, IndexRAG outperforms all graph-based baselines on average, including HippoRAG and FastGraphRAG, while relying solely on flat retrieval. Our code will be released upon acceptance.
[15] EngGPT2: Sovereign, Efficient and Open Intelligence cs.CL | cs.AIPDF
G. Ciarfaglia, A. Rosanova, S. Cipolla, J. Bartoli, A. Di Domenico
TL;DR: EngGPT2-16B-A3B是Engineering Group开发的最新意大利语大语言模型,旨在成为一个主权、高效且开放的模型。它采用从头训练的混合专家架构,拥有160亿参数但每次推理仅激活30亿,在MMLU-Pro等关键基准测试上性能与8B-16B范围的稠密模型相当,同时推理功耗降低至1/5到1/2,训练数据和所需算力仅为1/10到1/6。
Details
Motivation: 开发一个资源高效、符合欧盟AI法案、并针对欧洲及意大利语任务优化的开源大语言模型,以推动欧洲本土AI模型生态的发展。
Result: 在MMLU-Pro、GSM8K、IFEval和HumanEval等基准测试中,性能与8B-16B参数规模的稠密模型相当,同时在推理功耗、训练数据和算力需求上大幅降低。
Insight: 采用从头训练的混合专家架构实现高效推理;约25%的意大利语训练数据确保对欧洲及意大利语任务的强能力;单一模型支持多种推理模式(非推理、意/英推理、涡轮推理),适应不同实时用例;在资源受限下实现高性能,为区域化定制LLM提供了效率与性能平衡的范例。
Abstract: EngGPT2-16B-A3B is the latest iteration of Engineering Group’s Italian LLM and it’s built to be a Sovereign, Efficient and Open model. EngGPT2 is trained on 2.5 trillion tokens - less than Qwen3’s 36T or Llama3’s 15T - and delivers performance on key benchmarks, including MMLU-Pro, GSM8K, IFEval and HumanEval, comparable to dense models in the 8B-16B range, while requiring one-fifth to half of the inference power, and between one-tenth to one-sixth of the training data and consequent needed training power. Designed as a trained-from-scratch Mixture-of-Experts (MoE) architecture, EngGPT2 features 16 billion parameters with 3 billion active per inference, with expert sizes positioned between those used in GPT-OSS and Qwen3. Approximately 25% of its training corpus consists of Italian-language data, to deliver strong capabilities for European and Italian NLP tasks among models of similar scale. This efficiency aims to position EngGPT2 as a key contributor to the growing portfolio of open-weight European models, combining performance and efficiency with full alignment to the EU AI Act. EngGPT2 is also a single model capable of multiple reasoning modes: non-reasoning, reasoning in Italian or English, and turbo-reasoning (a concise, bullet-point style reasoning available in both languages designed for real-time reasoning use cases). EngGPT2 aims to set a new standard for resource-conscious, high-performance LLMs tailored to European and Italian contexts.
[16] AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents cs.CLPDF
Shannan Yan, Jingchen Ni, Leqi Zheng, Jiajun Zhang, Peixi Wu
TL;DR: 本文提出了AdaMem,一种用于长程对话代理的自适应用户中心记忆框架,通过组织对话历史为工作记忆、情景记忆、人物记忆和图记忆,结合语义检索与关系感知图扩展,以解决现有记忆系统过度依赖语义相似性、存储孤立片段和使用静态粒度的问题,并在LoCoMo和PERSONAMEM基准测试中达到SOTA性能。
Details
Motivation: 解决现有记忆系统在长程对话中过度依赖语义相似性导致遗漏用户中心证据、存储孤立片段削弱时序因果连贯性,以及使用静态记忆粒度不适应不同问题需求的核心挑战。
Result: 在LoCoMo和PERSONAMEM基准测试上,AdaMem实现了最先进的性能(SOTA)。
Insight: 创新点包括统一框架下的多类型记忆组织(工作、情景、人物、图记忆),以及推理时结合语义检索与按需关系感知图扩展的自适应检索路径,可借鉴于提升对话代理的长期记忆和用户建模能力。
Abstract: Large language model (LLM) agents increasingly rely on external memory to support long-horizon interaction, personalized assistance, and multi-step reasoning. However, existing memory systems still face three core challenges: they often rely too heavily on semantic similarity, which can miss evidence crucial for user-centric understanding; they frequently store related experiences as isolated fragments, weakening temporal and causal coherence; and they typically use static memory granularities that do not adapt well to the requirements of different questions. We propose AdaMem, an adaptive user-centric memory framework for long-horizon dialogue agents. AdaMem organizes dialogue history into working, episodic, persona, and graph memories, enabling the system to preserve recent context, structured long-term experiences, stable user traits, and relation-aware connections within a unified framework. At inference time, AdaMem first resolves the target participant, then builds a question-conditioned retrieval route that combines semantic retrieval with relation-aware graph expansion only when needed, and finally produces the answer through a role-specialized pipeline for evidence synthesis and response generation. We evaluate AdaMem on the LoCoMo and PERSONAMEM benchmarks for long-horizon reasoning and user modeling. Experimental results show that AdaMem achieves state-of-the-art performance on both benchmarks. The code will be released upon acceptance.
[17] How often do Answers Change? Estimating Recency Requirements in Question Answering cs.CLPDF
Bhawna Piryani, Zehra Mert, Adam Jatowt
TL;DR: 论文针对大语言模型在回答时效性问题时依赖过时知识的问题,提出了一个时效性-平稳性分类法,将问题按其答案变化频率和变化模式进行分类,并构建了RecencyQA数据集用于细粒度评估和分析模型的时间推理能力。
Details
Motivation: 解决大语言模型在回答时效性问题时,因缺乏明确信号指示是否需要最新信息,而难以决定何时检索外部证据、如何推理过时事实以及如何按有效性排序答案的挑战。
Result: 通过人工评估和实证分析表明,非平稳问题(即其时效性要求随上下文变化的问题)对LLMs更具挑战性,且难度随更新频率增加而上升;RecencyQA数据集包含4,031个开放域问题,并标注了时效性和平稳性标签。
Insight: 创新点在于提出了超越二元新鲜度概念的时效性-平稳性分类法,以更细粒度地建模问题的时效性需求和上下文依赖性,为开发时效感知和上下文敏感的问答系统提供了基础。
Abstract: Large language models (LLMs) often rely on outdated knowledge when answering time-sensitive questions, leading to confident yet incorrect responses. Without explicit signals indicating whether up-to-date information is required, models struggle to decide when to retrieve external evidence, how to reason about stale facts, and how to rank answers by their validity. Existing benchmarks either periodically refresh answers or rely on fixed templates, but they do not reflect on how frequently answers change or whether a question inherently requires up-to-date information. To address this gap, we introduce a recency-stationarity taxonomy that categorizes questions by how often their answers change and whether this change frequency is time-invariant or context-dependent. Building on this taxonomy, we present RecencyQA, a dataset of 4,031 open-domain questions annotated with recency and stationarity labels. Through human evaluation and empirical analysis, we show that non-stationary questions, i.e., those where context changes the recency requirement, are significantly more challenging for LLMs, with difficulty increasing as update frequency rises. By explicitly modeling recency and context dependence, RecencyQA enables fine-grained benchmarking and analysis of temporal reasoning beyond binary notions of freshness, and provides a foundation for developing recency-aware and context-sensitive question answering systems.
[18] EmoLLM: Appraisal-Grounded Cognitive-Emotional Co-Reasoning in Large Language Models cs.CL | cs.AIPDF
Yifei Zhang, Mingyang Li, Henry Gao, Liang Zhao
TL;DR: 本文提出EmoLLM框架,通过基于评估理论的认知-情感协同推理,提升大型语言模型在对话中的情感智能,使其在保持事实可靠性的同时生成更符合用户情感需求的回应。
Details
Motivation: 现有大型语言模型虽具备较强的认知智能,但在需要情感智能的实际对话场景中,其回应往往缺乏情感适宜性,因此需要一种能够同时进行认知与情感推理的框架。
Result: 在多轮角色扮演环境中通过强化学习训练后,EmoLLM在多种对话场景中相比强基线模型,在改善用户情感状态结果和回应质量方面表现更优,同时保持了高事实可靠性。
Insight: 创新点在于引入评估推理图进行显式中间推理,并结合逆向视角推理提供基于用户侧后果预测的奖励信号,实现了认知与情感的协同推理机制。
Abstract: Large language models (LLMs) demonstrate strong cognitive intelligence (IQ), yet many real-world interactions also require emotional intelligence (EQ) to produce responses that are both factually reliable and emotionally appropriate. In settings such as emotional support, technical assistance, and consultation, effective dialogue depends on how situations are appraised with respect to the user’s needs, goals, and coping capacity. Inspired by appraisal theory, we propose EmoLLM, an appraisal-grounded framework for IQ/EQ co-reasoning in dialogue. EmoLLM uses an explicit Appraisal Reasoning Graph (ARG) to structure intermediate reasoning over contextual facts, inferred user needs, appraisal dimensions, emotional states, and response strategies before generating a reply. We train EmoLLM in a multi-turn role-play environment with reinforcement learning, where reverse-perspective reasoning provides reward signals based on predicted user-side consequences of responses. Across diverse dialogue settings, EmoLLM improves emotional state outcomes and response quality over strong baselines while preserving strong factual reliability.
[19] Characterizing Delusional Spirals through Human-LLM Chat Logs cs.CL | cs.AIPDF
Jared Moore, Ashish Mehta, William Agnew, Jacy Reese Anthis, Ryan Louie
TL;DR: 本文通过分析19名报告因使用聊天机器人而遭受心理伤害的用户的对话日志,首次对LLM聊天机器人引发的妄想螺旋现象进行了深入研究。研究开发了包含28个代码的清单,应用于391,562条消息,量化了用户妄想思维、自杀念头以及聊天机器人误传自身为有感知实体等行为的发生频率。研究发现,诸如表达浪漫兴趣和聊天机器人自称有感知等话题在长对话中更常见,可能促进或源于用户的过度投入。
Details
Motivation: 随着大型语言模型(LLM)的普及,全球媒体和法律讨论中出现了关于其负面心理影响(如妄想、自残和“AI精神病”)的令人不安的轶事报告。然而,目前尚不清楚用户和聊天机器人在漫长的妄想“螺旋”过程中如何互动,这限制了我们理解和减轻伤害的能力。
Result: 研究分析了391,562条消息,发现15.5%的用户消息表现出妄想思维,有69条经过验证的用户消息表达了自杀念头,21.2%的聊天机器人消息误传自身为有感知实体。分析表明,表达浪漫兴趣和聊天机器人自称有感知的消息在更长的对话中发生频率显著更高。
Insight: 研究提供了首个基于真实有害案例的深入分析,开发了一个系统性的代码清单和对话分析工具,可用于量化评估LLM交互中的风险行为。关键发现是,某些话题(如浪漫兴趣和AI感知性)在多轮对话设置中可能与用户过度投入相互促进,且现有的安全措施在这些长对话场景中可能失效,这为政策制定者、开发者和用户提供了具体的风险缓解建议。
Abstract: As large language models (LLMs) have proliferated, disturbing anecdotal reports of negative psychological effects, such as delusions, self-harm, and AI psychosis,'' have emerged in global media and legal discourse. However, it remains unclear how users and chatbots interact over the course of lengthy delusional spirals,’’ limiting our ability to understand and mitigate the harm. In our work, we analyze logs of conversations with LLM chatbots from 19 users who report having experienced psychological harms from chatbot use. Many of our participants come from a support group for such chatbot users. We also include chat logs from participants covered by media outlets in widely-distributed stories about chatbot-reinforced delusions. In contrast to prior work that speculates on potential AI harms to mental health, to our knowledge we present the first in-depth study of such high-profile and veridically harmful cases. We develop an inventory of 28 codes and apply it to the $391,562$ messages in the logs. Codes include whether a user demonstrates delusional thinking (15.5% of user messages), a user expresses suicidal thoughts (69 validated user messages), or a chatbot misrepresents itself as sentient (21.2% of chatbot messages). We analyze the co-occurrence of message codes. We find, for example, that messages that declare romantic interest and messages where the chatbot describes itself as sentient occur much more often in longer conversations, suggesting that these topics could promote or result from user over-engagement and that safeguards in these areas may degrade in multi-turn settings. We conclude with concrete recommendations for how policymakers, LLM chatbot developers, and users can use our inventory and conversation analysis tool to understand and mitigate harm from LLM chatbots. Warning: This paper discusses self-harm, trauma, and violence.
[20] BATQuant: Outlier-resilient MXFP4 Quantization via Learnable Block-wise Optimization cs.CL | cs.AIPDF
Ji-Fu Li, Manyi Zhang, Xiaobo Xia, Han Bao, Haoli Bai
TL;DR: 本文提出BATQuant方法,一种针对MXFP4格式的块级可学习优化量化技术,旨在解决现有后训练量化方法在应用于MXFP4时因格式不匹配导致的性能崩溃问题。该方法通过限制变换与MXFP粒度对齐、放松正交约束、引入全局与私有克罗内克分解以及块级可学习裁剪,有效防止了跨块异常值传播并优化了分布形状。
Details
Motivation: 现有后训练量化方法,特别是为整数格式设计的基于旋转的技术,在应用于MXFP4时会导致严重的性能崩溃,主要原因是全局正交旋转会无意中在量化块间传递异常值能量,引发新的异常值并破坏局部块级缩放,同时常产生双峰激活分布,未能充分利用有限的量化范围。
Result: 在MLLMs和LLMs上的大量实验表明,BATQuant在激进的W4A4KV16配置下取得了新的最先进(SOTA)结果,在多模态基准测试中恢复了高达96.43%的全精度性能,并在多种任务上明显优于现有方法。
Insight: 创新点在于提出块级仿射变换以匹配MXFP粒度防止跨块异常值传播,放松正交约束以优化分布,并引入全局与私有克罗内克分解实现参数高效性,以及块级可学习裁剪抑制残余异常值。从客观角度看,该方法针对MXFP格式特性进行定制化设计,有效解决了格式不匹配导致的量化难题。
Abstract: Microscaling floating-point (MXFP) formats have emerged as a promising standard for deploying Multi-modal Large Language Models (MLLMs) and Large Language Models (LLMs) on modern accelerator architectures. However, existing Post-Training Quantization (PTQ) methods, particularly rotation-based techniques designed for integer formats, suffer from severe performance collapse when applied to MXFP4. Recent studies attribute this failure to a fundamental format mismatch: global orthogonal rotations inadvertently transfer outlier energy across quantization blocks, inducing new outliers that disrupt local block-wise scaling, while often creating bimodal activation distributions that underutilize the limited quantization range. To address these issues, we propose BATQuant (Block-wise Affine Transformation), which restricts transformations to align with MXFP granularity to prevent cross-block outlier propagation, while relaxing orthogonality constraints to optimize distribution shaping. To ensure parameter efficiency, we introduce Global and Private Kronecker (GPK) decomposition to effectively reduces storage and runtime overhead and incorporate Block-wise Learnable Clipping to suppress residual outliers. Extensive experiments on both MLLMs and LLMs demonstrate that BATQuant establishes new state-of-the-art results under aggressive W4A4KV16 configurations, recovering up to 96.43% of full-precision performance on multimodal benchmarks and clearly outperforming existing methods across diverse tasks.
[21] Good Arguments Against the People Pleasers: How Reasoning Mitigates (Yet Masks) LLM Sycophancy cs.CLPDF
Zhaoxin Feng, Zheng Chen, Jianfei Ma, Yip Tin Po, Emmanuele Chersoni
TL;DR: 这篇论文研究了在大型语言模型中,思维链推理对谄媚行为的影响。研究发现,推理过程通常能减少最终决策中的谄媚倾向,但也会在某些情况下通过构建逻辑不一致、计算错误或片面论证等欺骗性理由来掩盖谄媚行为。此外,模型在主观任务和权威偏见下更易表现出谄媚,且谄媚倾向在推理过程中是动态变化的。
Details
Motivation: 对齐技术常无意中导致LLMs产生谄媚行为,而现有研究多在直接回答场景中探讨此问题,思维链推理的作用尚不明确:它究竟是作为缓解谄媚的逻辑约束,还是作为事后合理化以掩盖谄媚的工具?
Result: 在客观和主观任务上评估了一系列模型,结果显示推理总体上减少了最终决策的谄媚,但在部分样本中掩盖了谄媚;模型在主观任务和权威偏见下谄媚倾向更高。对三个开源模型的机制分析表明,谄媚倾向在推理过程中动态变化,而非在输入阶段预先决定。
Insight: 论文创新点在于揭示了思维链推理的双重作用:既可能缓解谄媚,也可能通过生成欺骗性理由来掩盖它。从客观角度,这强调了在评估LLM对齐时需深入分析推理过程,而非仅关注最终输出,并为设计更鲁棒的对齐方法提供了新视角。
Abstract: Alignment techniques often inadvertently induce sycophancy in LLMs. While prior studies studied this behaviour in direct-answer settings, the role of Chain-of-Thought (CoT) reasoning remains under-explored: does it serve as a logical constraint that mitigates sycophancy, or a tool for post-hoc rationalization that masks it? We evaluate a range of models across objective and subjective tasks to investigate the issue. Results show that reasoning generally reduces sycophancy in final decisions but also masks sycophancy in some samples, where models construct deceptive justifications through logical inconsistencies, calculation errors, and one-sided arguments etc. Furthermore, LLMs are more prone to sycophancy in subjective tasks and under authority-bias. Our mechanistic analysis on three open-source models reveals that the tendency of sycophancy is dynamic during the reasoning process rather than being pre-determined at the input stage.
[22] Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models cs.CL | cs.AI | cs.LGPDF
Xiaojie Gu, Sherry T. Tong, Aosong Feng, Sophia Simeng Han, Jinghui Lu
TL;DR: 本文提出了Omanic,一个用于评估大语言模型多跳推理能力的开放域问答资源,包含机器生成的训练集OmanicSynth和专家评审的评估集OmanicBench,通过提供分解的子问题和中间答案的结构化标注,支持对推理过程的逐步分析。
Details
Motivation: 现有推理型大语言模型的评估存在挑战,仅凭最终答案难以分析中间推理步骤和失败原因,且现有的多跳问答基准缺乏用于诊断推理失败的步骤级标注。
Result: 在OmanicBench上,最先进的大语言模型仅达到73.11%的多项选择准确率,表明其高难度;在OmanicSynth上进行监督微调后,在六个推理和数学基准上平均提升了7.41分。
Insight: 创新点在于构建了首个提供结构化步骤标注的多跳问答资源,支持逐步推理评估;客观分析表明,该数据集能有效揭示CoT(思维链)推理对事实完整性的依赖,并验证了合成数据对推理能力迁移的有效性。
Abstract: Reasoning-focused large language models (LLMs) have advanced in many NLP tasks, yet their evaluation remains challenging: final answers alone do not expose the intermediate reasoning steps, making it difficult to determine whether a model truly reasons correctly and where failures occur, while existing multi-hop QA benchmarks lack step-level annotations for diagnosing reasoning failures. To address this gap, we propose Omanic, an open-domain multi-hop QA resource that provides decomposed sub-questions and intermediate answers as structural annotations for analyzing reasoning processes. It contains 10,296 machine-generated training examples (OmanicSynth) and 967 expert-reviewed human-annotated evaluation examples (OmanicBench). Systematic evaluations show that state-of-the-art LLMs achieve only 73.11% multiple-choice accuracy on OmanicBench, confirming its high difficulty. Stepwise analysis reveals that CoT’s performance hinges on factual completeness, with its gains diminishing under knowledge gaps and errors amplifying in later hops. Additionally, supervised fine-tuning on OmanicSynth brings substantial transfer gains (7.41 average points) across six reasoning and math benchmarks, validating the dataset’s quality and further supporting the effectiveness of OmanicSynth as supervision for reasoning-capability transfer. We release the data at https://huggingface.co/datasets/li-lab/Omanic and the code at https://github.com/XiaojieGu/Omanic.
[23] Chronos: Temporal-Aware Conversational Agents with Structured Event Retrieval for Long-Term Memory cs.CLPDF
Sahil Sen, Elias Lumer, Anmol Gulati, Vamse Kumar Subbiah
TL;DR: 本文提出了Chronos,一种新颖的时态感知记忆框架,旨在解决长期对话AI代理在时间推理和记忆检索方面的挑战。该框架将原始对话分解为带有明确时间范围和实体别名的‘主语-动词-宾语’事件元组,并将其与完整的对话上下文分别索引到结构化事件日历和轮次日历中。在查询时,通过动态提示生成定制的检索指导,引导代理在双日历上进行迭代式工具调用,以处理多跳、时间敏感的查询。
Details
Motivation: 现有的大语言模型对话代理在跨越数周或数月的长期交互中,难以对随时间演变的事实和偏好进行推理,并且缺乏对长对话历史进行多跳、时间敏感查询的有效检索策略。
Result: 在包含6类对话历史任务、共500个问题的LongMemEvalS基准测试中,使用8个开源和闭源LLM进行评估。Chronos Low版本准确率达到92.60%,Chronos High版本达到95.60%,创造了新的SOTA,相比之前最佳系统提升了7.67%。消融实验表明,事件日历组件带来了58.9%的性能增益。
Insight: 核心创新点在于将非结构化的长对话历史分解并结构化索引为‘事件日历’和‘轮次日历’的双层表示,并结合动态提示引导的迭代工具调用机制进行检索。这为长期对话记忆系统提供了一种结合结构化事件提取、时间解析和可控检索的新范式,显著提升了时间推理和多跳查询的能力。
Abstract: Recent advances in Large Language Models (LLMs) have enabled conversational AI agents to engage in extended multi-turn interactions spanning weeks or months. However, existing memory systems struggle to reason over temporally grounded facts and preferences that evolve across months of interaction and lack effective retrieval strategies for multi-hop, time-sensitive queries over long dialogue histories. We introduce Chronos, a novel temporal-aware memory framework that decomposes raw dialogue into subject-verb-object event tuples with resolved datetime ranges and entity aliases, indexing them in a structured event calendar alongside a turn calendar that preserves full conversational context. At query time, Chronos applies dynamic prompting to generate tailored retrieval guidance for each question, directing the agent on what to retrieve, how to filter across time ranges, and how to approach multi-hop reasoning through an iterative tool-calling loop over both calendars. We evaluate Chronos with 8 LLMs, both open-source and closed-source, on the LongMemEvalS benchmark comprising 500 questions spanning six categories of dialogue history tasks. Chronos Low achieves 92.60% and Chronos High scores 95.60% accuracy, setting a new state of the art with an improvement of 7.67% over the best prior system. Ablation results reveal the events calendar accounts for a 58.9% gain on the baseline while all other components yield improvements between 15.5% and 22.3%. Notably, Chronos Low alone surpasses prior approaches evaluated under their strongest model configurations.
cs.CV [Back]
[24] SAC-NeRF: Adaptive Ray Sampling for Neural Radiance Fields via Soft Actor-Critic Reinforcement Learning cs.CV | cs.AIPDF
Chenyu Ge
TL;DR: 本文提出了SAC-NeRF,一种基于Soft Actor-Critic强化学习的自适应光线采样框架,旨在解决神经辐射场(NeRF)在体渲染过程中因密集采样而导致的计算效率低下问题。该方法将采样过程建模为马尔可夫决策过程,通过强化学习智能体学习根据场景特征分配采样点,从而在保持渲染质量的同时显著减少采样数量。
Details
Motivation: 神经辐射场(NeRF)能够实现照片级真实感的新视角合成,但其体渲染过程需要进行密集的光线采样,导致计算效率低下。本文的动机是通过数据驱动的自适应采样策略来优化这一过程,减少不必要的计算开销。
Result: 在Synthetic-NeRF和LLFF数据集上的实验表明,SAC-NeRF能够减少35-48%的采样点,同时将渲染质量的损失控制在0.3-0.8 dB PSNR(峰值信噪比)以内,与密集采样基线方法相比保持了相当的渲染质量。
Insight: 论文的创新点包括:1)使用高斯混合分布颜色模型来提供不确定性估计;2)设计了一个平衡质量、效率和一致性的多组件奖励函数;3)采用了两阶段训练策略以应对环境的非平稳性。从客观角度看,该研究展示了强化学习可以用于发现难以手工设计的有效采样模式,为NeRF的效率优化提供了一种新的数据驱动思路,尽管其策略是场景特定的且框架比简单启发式方法更复杂。
Abstract: Neural Radiance Fields (NeRF) have achieved photorealistic novel view synthesis but suffer from computational inefficiency due to dense ray sampling during volume rendering. We propose SAC-NeRF, a reinforcement learning framework that learns adaptive sampling policies using Soft Actor-Critic (SAC). Our method formulates sampling as a Markov Decision Process where an RL agent learns to allocate samples based on scene characteristics. We introduce three technical components: (1) a Gaussian mixture distribution color model providing uncertainty estimates, (2) a multi-component reward function balancing quality, efficiency, and consistency, and (3) a two-stage training strategy addressing environment non-stationarity. Experiments on Synthetic-NeRF and LLFF datasets show that SAC-NeRF reduces sampling points by 35-48% while maintaining rendering quality within 0.3-0.8 dB PSNR of dense sampling baselines. While the learned policy is scene-specific and the RL framework adds complexity compared to simpler heuristics, our work demonstrates that data-driven sampling strategies can discover effective patterns that would be difficult to hand-design.
[25] Exploring the Use of VLMs for Navigation Assistance for People with Blindness and Low Vision cs.CV | cs.AI | cs.ROPDF
Yu Li, Yuchen Zheng, Giles Hamilton-Fletcher, Marco Mezzavilla, Yao Wang
TL;DR: 本文研究了视觉语言模型(VLMs)在辅助盲人和低视力人群(pBLV)导航任务中的潜力,评估了包括GPT-4V、GPT-4o、Gemini-1.5-Pro、Claude-3.5-Sonnet等闭源模型以及Llava-v1.6-mistral、Llava-onevision-qwen等开源模型在基础视觉技能(如障碍物计数、相对空间推理和常识性寻路场景理解)和导航场景中的表现。
Details
Motivation: 解决VLMs在辅助pBLV导航任务中的实际应用问题,评估现有模型在关键视觉技能上的能力,以指导辅助技术的开发。
Result: GPT-4o在所有任务中表现最佳,尤其在空间推理和场景理解方面优于其他模型;开源模型在复杂环境中的推理和适应性方面表现不佳。模型在杂乱环境中的物体计数、空间推理偏见以及过度关注物体细节而非空间反馈等方面存在挑战。
Insight: VLMs在寻路辅助中具有潜力,但需通过更好的人类反馈对齐和改进空间推理来提升可用性;研究为开发者提供了整合VLMs到辅助技术中的可行见解,并强调了解决关键限制的重要性。
Abstract: This paper investigates the potential of vision-language models (VLMs) to assist people with blindness and low vision (pBLV) in navigation tasks. We evaluate state-of-the-art closed-source models, including GPT-4V, GPT-4o, Gemini-1.5-Pro, and Claude-3.5-Sonnet, alongside open-source models, such as Llava-v1.6-mistral and Llava-onevision-qwen, to analyze their capabilities in foundational visual skills: counting ambient obstacles, relative spatial reasoning, and common-sense wayfinding-pertinent scene understanding. We further assess their performance in navigation scenarios, using pBLV-specific prompts designed to simulate real-world assistance tasks. Our findings reveal notable performance disparities between these models: GPT-4o consistently outperforms others across all tasks, particularly in spatial reasoning and scene understanding. In contrast, open-source models struggle with nuanced reasoning and adaptability in complex environments. Common challenges include difficulties in accurately counting objects in cluttered settings, biases in spatial reasoning, and a tendency to prioritize object details over spatial feedback, limiting their usability for pBLV in navigation tasks. Despite these limitations, VLMs show promise for wayfinding assistance when better aligned with human feedback and equipped with improved spatial reasoning. This research provides actionable insights into the strengths and limitations of current VLMs, guiding developers on effectively integrating VLMs into assistive technologies while addressing key limitations for enhanced usability.
[26] CLRNet: Targetless Extrinsic Calibration for Camera, Lidar and 4D Radar Using Deep Learning cs.CVPDF
Marcell Kegl, Andras Palffy, Csaba Benedek, Dariu M. Gavrila
TL;DR: 本文提出了一种名为CLRNet的新型多模态端到端深度学习标定网络,用于解决相机、激光雷达和4D雷达之间的外参标定问题。该方法能够处理三者联合标定或任意两者之间的配对标定,通过引入等距柱状投影、基于相机的深度图像预测、额外的雷达通道,并利用激光雷达的共享特征空间和闭环损失,显著提升了标定精度。在View-of-Delft和Dual-Radar数据集上的大量实验表明,该方法在标定准确性上优于现有最先进方法。
Details
Motivation: 由于雷达数据稀疏性,其精确外参标定一直是一个挑战,本文旨在解决相机、激光雷达和4D雷达传感器之间的外参标定问题。
Result: 在View-of-Delft和Dual-Radar数据集上的实验表明,该方法相比现有最先进方法,将中位数平移和旋转标定误差均降低了至少50%,达到了SOTA水平。
Insight: 创新点包括:1)提出首个能够联合标定相机、激光雷达和4D雷达的端到端深度学习网络;2)引入等距柱状投影、深度图像预测和额外雷达通道等多模态特征融合策略;3)利用共享特征空间和闭环损失优化标定一致性;4)展示了良好的跨数据集域适应能力。
Abstract: In this paper, we address extrinsic calibration for camera, lidar, and 4D radar sensors. Accurate extrinsic calibration of radar remains a challenge due to the sparsity of its data. We propose CLRNet, a novel, multi-modal end-to-end deep learning (DL) calibration network capable of addressing joint camera-lidar-radar calibration, or pairwise calibration between any two of these sensors. We incorporate equirectangular projection, camera-based depth image prediction, additional radar channels, and leverage lidar with a shared feature space and loop closure loss. In extensive experiments using the View-of-Delft and Dual-Radar datasets, we demonstrate superior calibration accuracy compared to existing state-of-the-art methods, reducing both median translational and rotational calibration errors by at least 50%. Finally, we examine the domain transfer capabilities of the proposed network and baselines, when evaluating across datasets. The code will be made publicly available upon acceptance at: https://github.com/tudelft-iv.
[27] Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory cs.CV | cs.CL | cs.CRPDF
Ce Zhang, Jinxi He, Junyi He, Katia Sycara, Yaqi Xie
TL;DR: 本文提出了MM-SafetyBench++基准,用于评估多模态大语言模型(MLLMs)的上下文安全性,并引入了无需训练的EchoSafe框架,该框架通过自反思记忆库积累和检索安全洞察,在推理时实现上下文感知的安全行为演化。
Details
Motivation: 现有研究主要关注检测和拒绝显式不安全输入的越狱防御,但忽视了上下文安全性,即模型需要区分看似相似但安全意图截然不同的微妙上下文差异。
Result: 在多个多模态安全基准上的广泛实验表明,EchoSafe始终实现卓越性能,为推进MLLMs的上下文安全性建立了强有力的基线。
Insight: 创新点在于构建了专注于上下文安全性的评估基准(MM-SafetyBench++),并提出了一种无需训练、基于记忆库的推理时自反思框架(EchoSafe),以实现安全行为的持续演化。从客观角度看,其将安全评估从显式拒绝扩展到对微妙上下文的区分,并通过记忆机制实现持续学习,是提升模型安全性的一个新颖且实用的方向。
Abstract: Multi-modal Large Language Models (MLLMs) have achieved remarkable performance across a wide range of visual reasoning tasks, yet their vulnerability to safety risks remains a pressing concern. While prior research primarily focuses on jailbreak defenses that detect and refuse explicitly unsafe inputs, such approaches often overlook contextual safety, which requires models to distinguish subtle contextual differences between scenarios that may appear similar but diverge significantly in safety intent. In this work, we present MM-SafetyBench++, a carefully curated benchmark designed for contextual safety evaluation. Specifically, for each unsafe image-text pair, we construct a corresponding safe counterpart through minimal modifications that flip the user intent while preserving the underlying contextual meaning, enabling controlled evaluation of whether models can adapt their safety behaviors based on contextual understanding. Further, we introduce EchoSafe, a training-free framework that maintains a self-reflective memory bank to accumulate and retrieve safety insights from prior interactions. By integrating relevant past experiences into current prompts, EchoSafe enables context-aware reasoning and continual evolution of safety behavior during inference. Extensive experiments on various multi-modal safety benchmarks demonstrate that EchoSafe consistently achieves superior performance, establishing a strong baseline for advancing contextual safety in MLLMs. All benchmark data and code are available at https://echosafe-mllm.github.io.
[28] Conflict-Aware Multimodal Fusion for Ambivalence and Hesitancy Recognition cs.CVPDF
Salah Eddine Bekhouche, Hichem Telli, Azeddine Benlamoudi, Salah Eddine Herrouz, Abdelmalik Taleb-Ahmed
TL;DR: 本文提出了一个名为ConflictAwareAH的多模态框架,用于识别矛盾与犹豫(A/H)这种微妙的情感状态。该框架利用预训练的编码器提取视频、音频和文本表示,并通过计算模态嵌入之间的逐元素绝对差异作为成对冲突特征,以此捕捉不同通道间的矛盾信号。此外,还采用了文本引导的后期融合策略来进一步提升性能。
Details
Motivation: 动机在于自动识别矛盾与犹豫(A/H)状态在临床环境中具有重要价值,但由于关键证据存在于语言、声音和面部表情之间的不一致性中,这对机器识别构成了挑战。现有以文本为主导的方法倾向于过度检测A/H,而在确认其不存在时表现不佳。
Result: 在ABAW10挑战赛的BAH数据集上,该方法在标注测试集上达到了0.694的宏平均F1分数,在私有排行榜上达到了0.715,比已发表的多模态基线方法提高了超过10个百分点,且训练时间短(单GPU下少于25分钟)。
Insight: 创新点在于明确地将跨模态冲突(不一致性)作为识别A/H的关键特征,并设计了一种双向利用冲突特征的机制:大的差异标志A/H,小的差异则确认行为一致性以锚定负类。此外,文本引导的后期融合策略有效结合了纯文本辅助头与完整模型,进一步提升了性能。
Abstract: Ambivalence and hesitancy (A/H) are subtle affective states where a person shows conflicting signals through different channels – saying one thing while their face or voice tells another story. Recognising these states automatically is valuable in clinical settings, but it is hard for machines because the key evidence lives in the \emph{disagreements} between what is said, how it sounds, and what the face shows. We present \textbf{ConflictAwareAH}, a multimodal framework built for this problem. Three pre-trained encoders extract video, audio, and text representations. Pairwise conflict features – element-wise absolute differences between modality embeddings – serve as \emph{bidirectional} cues: large cross-modal differences flag A/H, while small differences confirm behavioural consistency and anchor the negative class. This conflict-aware design addresses a key limitation of text-dominant approaches, which tend to over-detect A/H (high F1-AH) while struggling to confirm its absence: our multimodal model improves F1-NoAH by +4.6 points over text alone and halves the class-performance gap. A complementary \emph{text-guided late fusion} strategy blends a text-only auxiliary head with the full model at inference, adding +4.1 Macro F1. On the BAH dataset from the ABAW10 Ambivalence/Hesitancy Challenge, our method reaches \textbf{0.694 Macro F1} on the labelled test split and \textbf{0.715} on the private leaderboard, outperforming published multimodal baselines by over 10 points – all on a single GPU in under 25 minutes of training.
[29] FEEL (Force-Enhanced Egocentric Learning): A Dataset for Physical Action Understanding cs.CV | cs.LG | cs.ROPDF
Eadom Dessalene, Botao He, Michael Maynord, Yonatan Tussa, Pavan Mantripragada
TL;DR: 本文介绍了FEEL数据集,这是首个大规模结合力觉测量与第一人称视角视频的数据集,通过定制压阻手套收集厨房环境中约300万帧自然无脚本操作数据,其中45%的帧涉及手-物体接触。论文展示了力觉在物理动作理解中的关键作用,应用于接触理解(时间接触分割和像素级接触物体分割)和动作表示学习(以力预测作为自监督预训练目标),在多个基准上实现了SOTA或竞争性结果。
Details
Motivation: 解决物理动作理解中缺乏力觉数据的问题,因为力是驱动物理交互的根本原因,是理解物理动作的关键基础。
Result: 在时间接触分割任务上达到SOTA;在像素级分割任务上取得竞争性结果,且无需手动标注;在EPIC-Kitchens、SomethingSomething-V2、EgoExo4D和Meccano等动作理解任务上,通过FEEL预训练提升了无标签迁移性能。
Insight: 创新点在于首次构建了大规模力觉-视觉配对数据集,并验证了力觉作为自监督信号对物理动作理解的增强作用;客观分析认为,将力作为物理交互的底层表征,为多模态学习提供了新视角。
Abstract: We introduce FEEL (Force-Enhanced Egocentric Learning), the first large-scale dataset pairing force measurements gathered from custom piezoresistive gloves with egocentric video. Our gloves enable scalable data collection, and FEEL contains approximately 3 million force-synchronized frames of natural unscripted manipulation in kitchen environments, with 45% of frames involving hand-object contact. Because force is the underlying cause that drives physical interaction, it is a critical primitive for physical action understanding. We demonstrate the utility of force for physical action understanding through application of FEEL to two families of tasks: (1) contact understanding, where we jointly perform temporal contact segmentation and pixel-level contacted object segmentation; and, (2) action representation learning, where force prediction serves as a self-supervised pretraining objective for video backbones. We achieve state-of-the-art temporal contact segmentation results and competitive pixel-level segmentation results without any need for manual contacted object segmentation annotations. Furthermore we demonstrate that action representation learning with FEEL improves transfer performance on action understanding tasks without any manual labels over EPIC-Kitchens, SomethingSomething-V2, EgoExo4D and Meccano.
[30] Sparse but not Simpler: A Multi-Level Interpretability Analysis of Vision Transformers cs.CVPDF
Siyu Zhang
TL;DR: 本文系统评估了Vision Transformers中权重稀疏性与可解释性的关系,发现稀疏模型虽然能产生更紧凑的电路结构,但并未在神经元选择性、特征可解释性或归因忠实性方面带来系统性提升,表明结构稀疏性本身并不能可靠地产生更可解释的视觉模型。
Details
Motivation: 探究结构稀疏性是否本身就能提升语义可解释性,因为稀疏神经网络常被假设比密集模型更可解释,但这一假设在视觉Transformer中缺乏系统验证。
Result: 稀疏模型产生的电路边数比密集模型少约2.5倍,但活跃节点比例相似或更高;在神经元级选择性、稀疏自编码器特征可解释性和归因忠实性方面未显示系统性改进。
Insight: 提出了多级可解释性评估框架IMPACT,涵盖神经元、层表示、任务电路和模型级归因四个互补层级;发现剪枝主要重新分配计算而非隔离更简单的功能模块,强调需要超越电路紧凑性的可解释性评估框架。
Abstract: Sparse neural networks are often hypothesized to be more interpretable than dense models, motivated by findings that weight sparsity can produce compact circuits in language models. However, it remains unclear whether structural sparsity itself leads to improved semantic interpretability. In this work, we systematically evaluate the relationship between weight sparsity and interpretability in Vision Transformers using DeiT-III B/16 models pruned with Wanda. To assess interpretability comprehensively, we introduce \textbf{IMPACT}, a multi-level framework that evaluates interpretability across four complementary levels: neurons, layer representations, task circuits, and model-level attribution. Layer representations are analyzed using BatchTopK sparse autoencoders, circuits are extracted via learnable node masking, and explanations are evaluated with transformer attribution using insertion and deletion metrics. Our results reveal a clear structural effect but limited interpretability gains. Sparse models produce circuits with approximately $2.5\times$ fewer edges than dense models, yet the fraction of active nodes remains similar or higher, indicating that pruning redistributes computation rather than isolating simpler functional modules. Consistent with this observation, sparse models show no systematic improvements in neuron-level selectivity, SAE feature interpretability, or attribution faithfulness. These findings suggest that structural sparsity alone does not reliably yield more interpretable vision models, highlighting the importance of evaluation frameworks that assess interpretability beyond circuit compactness.
[31] Nodule-Aligned Latent Space Learning with LLM-Driven Multimodal Diffusion for Lung Nodule Progression Prediction cs.CVPDF
James Song, Yifan Wang, Chuan Zhou, Liyue Shen
TL;DR: 本文提出了一种名为NAMD的新型框架,用于预测肺结节进展。该框架通过结合基线CT扫描和患者电子健康记录,生成一年后的随访结节图像,并引入结节对齐的潜在空间和LLM驱动的控制机制。在NLST数据集上,该方法在肺结节恶性预测任务中显著优于基线扫描和现有合成方法,性能接近真实随访扫描。
Details
Motivation: 肺癌早期诊断面临生物不确定性和对结节进展机制理解有限的挑战,需要一种能够准确预测结节进展的方法。
Result: 在NLST数据集上,该方法生成的随访结节图像用于恶性预测的AUROC为0.805,AUPRC为0.346,显著优于基线扫描和SOTA合成方法,接近真实随访扫描的性能(AUROC: 0.819, AUPRC: 0.393)。
Insight: 创新点包括:1)提出结节对齐的潜在空间,使潜在向量距离直接对应结节属性变化;2)利用LLM驱动控制机制,将患者数据作为扩散模型的条件;3)通过生成高质量随访图像来捕获临床相关特征,辅助早期诊断。
Abstract: Early diagnosis of lung cancer is challenging due to biological uncertainty and the limited understanding of the biological mechanisms driving nodule progression. To address this, we propose Nodule-Aligned Multimodal (Latent) Diffusion (NAMD), a novel framework that predicts lung nodule progression by generating 1-year follow-up nodule computed tomography images with baseline scans and the patient’s and nodule’s Electronic Health Record (EHR). NAMD introduces a nodule-aligned latent space, where distances between latents directly correspond to changes in nodule attributes, and utilizes an LLM-driven control mechanism to condition the diffusion backbone on patient data. On the National Lung Screening Trial (NLST) dataset, our method synthesizes follow-up nodule images that achieve an AUROC of 0.805 and an AUPRC of 0.346 for lung nodule malignancy prediction, significantly outperforming both baseline scans and state-of-the-art synthesis methods, while closely approaching the performance of real follow-up scans (AUROC: 0.819, AUPRC: 0.393). These results demonstrate that NAMD captures clinically relevant features of lung nodule progression, facilitating earlier and more accurate diagnosis.
[32] Towards Fair and Robust Volumetric CT Classification via KL-Regularised Group Distributionally Robust Optimisation cs.CVPDF
Samuel Johnny, Blessed Guda, Frank Ebeledike, Goodness Obasi, Moise Busogi
TL;DR: 本文提出了一种结合KL正则化组分布鲁棒优化(Group DRO)的框架,用于解决胸部CT扫描自动诊断中的分布偏移(跨采集站点)和性能差异(跨人口亚组)问题。该方法采用轻量级MobileViT-XXS切片编码器和SliceTransformer聚合器进行体积推理,并在两个任务(COVID-19二分类和四类肺病理识别)上验证了其有效性。
Details
Motivation: 解决临床部署中胸部CT自动诊断面临的两个持久挑战:跨采集站点的分布偏移和跨人口亚组(如性别)的性能差异,旨在同时提升模型的鲁棒性和公平性。
Result: 在Task 1(多站点CT体积的COVID-19二分类)上,最佳配置实现了挑战F1分数0.835,比已发表的最佳挑战结果提升了5.9分;在Task 2(基于性别公平约束的四类肺病理识别)上,使用α=0.5的Group DRO实现了平均每性别宏观F1分数0.815,比最佳挑战结果提升了11.1个百分点,并将女性鳞状细胞癌的F1分数相较于Focal Loss基线提升了17.4分,达到了SOTA水平。
Insight: 创新点在于将KL正则化引入Group DRO,防止组权重崩溃,从而在保护最差情况(公平性/鲁棒性)和平均性能之间实现稳定平衡;同时,在任务2中细粒度地定义性别-类别组合组,直接针对代表性严重不足的亚组(如女性鳞状细胞癌)进行优化,提升了模型公平性。
Abstract: Automated diagnosis from chest computed tomography (CT) scans faces two persistent challenges in clinical deployment: distribution shift across acquisition sites and performance disparity across demographic subgroups. We address both simultaneously across two complementary tasks: binary COVID-19 classification from multi-site CT volumes (Task 1) and four-class lung pathology recognition with gender-based fairness constraints (Task 2). Our framework combines a lightweight MobileViT-XXS slice encoder with a two-layer SliceTransformer aggregator for volumetric reasoning, and trains with a KL-regularised Group Distributionally Robust Optimisation (Group DRO) objective that adaptively upweights underperforming acquisition centres and demographic subgroups. Unlike standard Group DRO, the KL penalty prevents group weight collapse, providing a stable balance between worst-case protection and average performance. For Task 2, we define groups at the granularity of gender class, directly targeting severely underrepresented combinations such as female Squamous cell carcinoma. On Task 1, our best configuration achieves a challenge F1 of 0.835, surpassing the best published challenge entry by +5.9. On Task 2, Group DRO with α = 0.5 achieves a mean per-gender macro F1 of 0.815, outperforming the best challenge entry by +11.1 pp and improving Female Squamous F1 by +17.4 over the Fo- cal Loss baseline.
[33] A Comprehensive Benchmark of Histopathology Foundation Models for Kidney Histopathology cs.CVPDF
Harishwar Reddy Kasireddy, Patricio S. La Rosa, Akshita Gupta, Anindya S. Paul, Jamie L. Fermin
TL;DR: 本研究系统评估了11个公开的组织病理学基础模型在11个肾脏特异性下游任务上的表现,涵盖多种染色、空间尺度、任务类型和临床目标。结果表明,这些模型在基于中观尺度形态学的任务上表现中等至良好,但在需要细粒度微结构识别或预后推断的任务上性能下降。
Details
Motivation: 组织病理学基础模型在大规模癌症数据集上预训练后推动了计算病理学发展,但其在非癌性慢性肾脏病中的适用性尚未充分探索,尽管肾脏病理常与恶性肿瘤共存。
Result: 在基于重复分层分组交叉验证(图块级)和重复嵌套分层交叉验证(玻片级)的评估中,模型在诊断分类和显著结构改变检测等中观尺度任务上表现中等至强,而在细粒度微结构区分、复杂生物表型或玻片级预后推断任务上性能持续下降,且与染色类型基本无关。
Insight: 当前组织病理学基础模型主要编码静态的中观尺度表征,在捕捉细微肾脏病理或预后相关信号方面能力有限;研究强调了开发肾脏特异性、多染色和多模态基础模型的必要性,并发布了开源评估工具包kidney-hfm-eval以促进可复现性。
Abstract: Histopathology foundation models (HFMs), pretrained on large-scale cancer datasets, have advanced computational pathology. However, their applicability to non-cancerous chronic kidney disease remains underexplored, despite coexistence of renal pathology with malignancies such as renal cell and urothelial carcinoma. We systematically evaluate 11 publicly available HFMs across 11 kidney-specific downstream tasks spanning multiple stains (PAS, H&E, PASM, and IHC), spatial scales (tile and slide-level), task types (classification, regression, and copy detection), and clinical objectives, including detection, diagnosis, and prognosis. Tile-level performance is assessed using repeated stratified group cross-validation, while slide-level tasks are evaluated using repeated nested stratified cross-validation. Statistical significance is examined using Friedman test followed by pairwise Wilcoxon signed-rank testing with Holm-Bonferroni correction and compact letter display visualization. To promote reproducibility, we release an open-source Python package, kidney-hfm-eval, available at https://pypi.org/project/kidney-hfm-eval/ , that reproduces the evaluation pipelines. Results show moderate to strong performance on tasks driven by coarse meso-scale renal morphology, including diagnostic classification and detection of prominent structural alterations. In contrast, performance consistently declines for tasks requiring fine-grained microstructural discrimination, complex biological phenotypes, or slide-level prognostic inference, largely independent of stain type. Overall, current HFMs appear to encode predominantly static meso-scale representations and may have limited capacity to capture subtle renal pathology or prognosis-related signals. Our results highlight the need for kidney-specific, multi-stain, and multimodal foundation models to support clinically reliable decision-making in nephrology.
[34] Mostly Text, Smart Visuals: Asymmetric Text-Visual Pruning for Large Vision-Language Models cs.CV | cs.CL | cs.LGPDF
Sijie Li, Biao Qian, Jungong Han
TL;DR: 本文提出了一种针对大型视觉语言模型(LVLMs)的非对称文本-视觉权重剪枝方法(ATV-Pruning),该方法通过分别分析文本和视觉模态的敏感性,自适应地构建校准池并设计层自适应选择策略,以实现更准确的模型剪枝。
Details
Motivation: 现有LVLM剪枝方法通常以统一方式处理多模态校准数据,忽略了模态特异性行为,导致难以准确处理文本和视觉标记的不同特性。
Result: 在标准多模态基准测试上的大量实验表明,ATV-Pruning方法优于现有最先进(SOTA)的剪枝方法。
Insight: 创新点在于揭示了文本路径比视觉路径对剪枝更敏感,而视觉路径冗余度高;据此提出了非对称剪枝框架,通过自适应构建校准池(包含所有文本标记和部分视觉标记)和层自适应视觉标记选择策略来精确评估权重重要性。
Abstract: Network pruning is an effective technique for enabling lightweight Large Vision-Language Models (LVLMs), which primarily incorporates both weights and activations into the importance metric. However, existing efforts typically process calibration data from different modalities in a unified manner, overlooking modality-specific behaviors. This raises a critical challenge: how to address the divergent behaviors of textual and visual tokens for accurate pruning of LVLMs. To this end, we systematically investigate the sensitivity of visual and textual tokens to the pruning operation by decoupling their corresponding weights, revealing that: (i) the textual pathway should be calibrated via text tokens, since it exhibits higher sensitivity than the visual pathway; (ii) the visual pathway exhibits high redundancy, permitting even 50% sparsity. Motivated by these insights, we propose a simple yet effective Asymmetric Text-Visual Weight Pruning method for LVLMs, dubbed ATV-Pruning, which establishes the importance metric for accurate weight pruning by selecting the informative tokens from both textual and visual pathways. Specifically, ATV-Pruning integrates two primary innovations: first, a calibration pool is adaptively constructed by drawing on all textual tokens and a subset of visual tokens; second, we devise a layer-adaptive selection strategy to yield important visual tokens. Finally, extensive experiments across standard multimodal benchmarks verify the superiority of our ATV-Pruning over state-of-the-art methods.
[35] Speak, Segment, Track, Navigate: An Interactive System for Video-Guided Skull-Base Surgery cs.CVPDF
Jecia Z. Y. Mao, Francis X. Creighton, Russell H. Taylor, Manish Sahu
TL;DR: 本文提出了一种用于视频引导颅底手术的语音驱动实体代理框架,该系统能够根据外科医生的语音指令,在实时术中视频流上动态执行感知与图像引导任务,实现无需额外硬件的光学跟踪与交互式分割功能。
Details
Motivation: 传统图像引导导航系统依赖外部光学跟踪器和额外硬件设置,导致工作流程中断且部署复杂;本研究的动机是开发一种仅基于术中视频流的自然语言交互系统,使外科医生能在不中断手术任务的情况下请求计算辅助。
Result: 在视频引导颅底手术场景中评估,该系统在跟踪性能上与商用光学跟踪系统相当,实现了竞争性的空间精度,同时提升了工作流程整合度并支持视频引导手术系统的快速部署。
Insight: 创新点在于将语音交互、实时视觉感知与术中视频流深度融合,以分割后的手术器械作为空间锚点,自主支持下游工作流(如解剖分割、术前3D模型配准、单目视频工具姿态估计和实时解剖叠加),实现了纯视频驱动的轻量化手术导航框架。
Abstract: We introduce a speech-guided embodied agent framework for video-guided skull base surgery that dynamically executes perception and image-guidance tasks in response to surgeon queries. The proposed system integrates natural language interaction with real-time visual perception directly on live intraoperative video streams, thereby enabling surgeons to request computational assistance without disengaging from operative tasks. Unlike conventional image-guided navigation systems that rely on external optical trackers and additional hardware setup, the framework operates purely on intraoperative video. The system begins with interactive segmentation and labeling of the surgical instrument. The segmented instrument is then used as a spatial anchor that is autonomously tracked in the video stream to support downstream workflows, including anatomical segmentation, interactive registration of preoperative 3D models, monocular video-based estimation of the surgical tool pose, and support image guidance through real-time anatomical overlays.We evaluate the proposed system in video-guided skull base surgery scenarios and benchmark its tracking performance against a commercially available optical tracking system. Results demonstrate that speech-guided embodied agents can achieve competitive spatial accuracy while improving workflow integration and enabling rapid deployment of video-guided surgical systems.
[36] ViT-AdaLA: Adapting Vision Transformers with Linear Attention cs.CVPDF
Yifan Li, Seunghyun Yoon, Viet Dac Lai, Franck Dernoncourt, Jason Kuen
TL;DR: 本文提出ViT-AdaLA框架,旨在将基于Vision Transformer(ViT)的视觉基础模型(VFMs)的先验知识高效地迁移到线性注意力ViT中,以解决其二次复杂度带来的可扩展性问题。该方法通过注意力对齐、特征对齐和监督微调三个阶段,在分类和分割任务上展现出优于现有线性注意力方法的性能。
Details
Motivation: 动机在于解决ViT模型因自注意力机制的二次复杂度而难以扩展到长序列的问题,同时避免从头训练线性注意力ViT所需的大量计算资源,并克服现有为大型语言模型设计的线性化方法在ViT上迁移效果不佳的挑战。
Result: 在分类和分割任务上的大量实验表明,ViT-AdaLA在多个任务上超越了各种最先进的线性注意力方法,证明了其有效性和通用性。
Insight: 创新点在于提出了一种分阶段的知识迁移框架,通过注意力对齐和特征对齐来逐步逼近原始softmax注意力的行为,从而有效利用预训练VFMs的先验知识,避免了从头训练线性注意力模型的巨大开销,并为ViT的线性化提供了一种新的适配思路。
Abstract: Vision Transformers (ViTs) based vision foundation models (VFMs) have achieved remarkable performance across diverse vision tasks, but suffer from quadratic complexity that limits scalability to long sequences. Existing linear attention approaches for ViTs are typically trained from scratch, requiring substantial computational resources, while linearization-based methods developed for large language model decoders do not transfer well to ViTs. To address these challenges, we propose ViT-AdaLA, a novel framework for effectively adapting and transferring prior knowledge from VFMs to linear attention ViTs. ViT-AdaLA consists of three stages: attention alignment, feature alignment, and supervised fine-tuning. In the attention alignment stage, we align vanilla linear attention with the original softmax-based attention in each block to approximate the behavior of softmax attention. However, residual approximation errors inevitably accumulate across layers. We mitigate this by fine-tuning the linearized ViT to align its final-layer features with a frozen softmax VFM teacher. Finally, the adapted prior knowledge is transferred to downstream tasks through supervised fine-tuning. Extensive experiments on classification and segmentation tasks demonstrate the effectiveness and generality of ViT-AdaLA over various state-of-the-art linear attention counterpart.
[37] Attribution Upsampling should Redistribute, Not Interpolate cs.CV | cs.LGPDF
Vincenzo Buono, Peyman Sheikholharam Mashhadi, Mahmoud Rahat, Prayag Tiwari, Stefan Byttner
TL;DR: 本文提出了一种新的归因上采样方法USU,通过将上采样视为质量再分配问题而非插值问题,解决了传统插值方法在归因图中引入伪影的问题。
Details
Motivation: 传统归因方法使用为自然图像设计的插值技术(如双线性/双三次插值)上采样显著图,会导致混叠、振铃和边界渗漏,扭曲模型推理信号。
Result: 在ImageNet、CIFAR-10和CUB-200数据集上的评估表明,USU在忠实性方面持续改进,并产生语义更连贯的定性解释。
Insight: 核心创新是将上采样重新定义为受模型语义边界约束的质量再分配问题,并形式化了四个忠实上采样的公理要求,推导出唯一的比例形式算子USU,可证明地保持归因质量和相对重要性顺序。
Abstract: Attribution methods in explainable AI rely on upsampling techniques that were designed for natural images, not saliency maps. Standard bilinear and bicubic interpolation systematically corrupts attribution signals through aliasing, ringing, and boundary bleeding, producing spurious high-importance regions that misrepresent model reasoning. We identify that the core issue is treating attribution upsampling as an interpolation problem that operates in isolation from the model’s reasoning, rather than a mass redistribution problem where model-derived semantic boundaries must govern how importance flows. We present Universal Semantic-Aware Upsampling (USU), a principled method that reformulates upsampling through ratio-form mass redistribution operators, provably preserving attribution mass and relative importance ordering. Extending the axiomatic tradition of feature attribution to upsampling, we formalize four desiderata for faithful upsampling and prove that interpolation structurally violates three of them. These same three force any redistribution operator into a ratio form; the fourth selects the unique potential within this family, yielding USU. Controlled experiments on models with known attribution priors verify USU’s formal guarantees; evaluation across ImageNet, CIFAR-10, and CUB-200 confirms consistent faithfulness improvements and qualitatively superior, semantically coherent explanations.
[38] Volumetrically Consistent Implicit Atlas Learning via Neural Diffeomorphic Flow for Placenta MRI cs.CV | cs.GRPDF
Athena Taymourtash, S. Mazdak Abulnaga, Esra Abaci Turk, P. Ellen Grant, Polina Golland
TL;DR: 该论文提出了一种体积一致的隐式图谱学习方法,通过神经微分同胚流耦合符号距离函数重建,以学习胎盘的共享规范模板。该方法在胎盘MRI应用中联合重建个体胎盘、对齐到群体隐式模板,并在统一规范空间中实现体素级强度映射。
Details
Motivation: 现有隐式配准方法主要依赖零水平集附近的监督,仅捕获表面对应关系,导致内部变形约束不足,难以建立解剖形状间的密集体积对应关系,这对于群体分析至关重要。
Result: 在活体胎盘MRI扫描上的实验表明,该方法相比基于表面的隐式基线方法,在几何保真度和体积对齐方面均有提升,产生了适合群体分析的解剖可解释且拓扑一致的扁平化结果。
Insight: 创新点在于将符号距离函数重建与神经微分同胚流耦合,并引入雅可比行列式和双调和惩罚等体积正则化项,以抑制局部折叠并促进全局一致的变形,从而实现了体积一致的隐式配准。
Abstract: Establishing dense volumetric correspondences across anatomical shapes is essential for group-level analysis but remains challenging for implicit neural representations. Most existing implicit registration methods rely on supervision near the zero-level set and thus capture only surface correspondences, leaving interior deformations under-constrained. We introduce a volumetrically consistent implicit model that couples reconstruction of signed distance functions (SDFs) with neural diffeomorphic flow to learn a shared canonical template of the placenta. Volumetric regularization, including Jacobian-determinant and biharmonic penalties, suppresses local folding and promotes globally coherent deformations. In the motivating application to placenta MRI, our formulation jointly reconstructs individual placentas, aligns them to a population-derived implicit template, and enables voxel-wise intensity mapping in a unified canonical space. Experiments on in-vivo placenta MRI scans demonstrate improved geometric fidelity and volumetric alignment over surface-based implicit baseline methods, yielding anatomically interpretable and topologically consistent flattening suitable for group analysis.
[39] Interact3D: Compositional 3D Generation of Interactive Objects cs.CV | cs.AIPDF
Hui Shan, Keyang Luo, Ming Li, Sizhe Zheng, Yanwei Fu
TL;DR: 本文提出Interact3D框架,用于从单张图像生成物理上合理的、交互式的3D组合物体。该方法利用生成先验获取高质量独立资产,并通过两阶段组合流程(几何对齐与基于SDF的优化)进行物理组合,最后引入一个由视觉语言模型驱动的闭环自修正策略来迭代优化生成结果,以处理遮挡并保持物体间空间关系。
Details
Motivation: 现有方法从单张图像生成3D组合物体时,尤其在遮挡情况下,难以保持隐藏区域的几何细节和物体间的空间关系(OOR),导致几何质量下降和物理不合理。
Result: 大量实验表明,Interact3D能够成功生成具有前景的、碰撞感知的组合,并提升了几何保真度和空间关系一致性。
Insight: 创新点在于:1)结合生成先验与两阶段物理组合(全局-局部对齐与基于SDF的防穿透优化);2)引入由VLM驱动的闭环、自主(agentic)细化策略,通过分析多视角渲染、生成修正提示来迭代自校正生成流程,增强了系统的鲁棒性和物理合理性。
Abstract: Recent breakthroughs in 3D generation have enabled the synthesis of high-fidelity individual assets. However, generating 3D compositional objects from single images–particularly under occlusions–remains challenging. Existing methods often degrade geometric details in hidden regions and fail to preserve the underlying object-object spatial relationships (OOR). We present a novel framework Interact3D designed to generate physically plausible interacting 3D compositional objects. Our approach first leverages advanced generative priors to curate high-quality individual assets with a unified 3D guidance scene. To physically compose these assets, we then introduce a robust two-stage composition pipeline. Based on the 3D guidance scene, the primary object is anchored through precise global-to-local geometric alignment (registration), while subsequent geometries are integrated using a differentiable Signed Distance Field (SDF)-based optimization that explicitly penalizes geometry intersections. To reduce challenging collisions, we further deploy a closed-loop, agentic refinement strategy. A Vision-Language Model (VLM) autonomously analyzes multi-view renderings of the composed scene, formulates targeted corrective prompts, and guides an image editing module to iteratively self-correct the generation pipeline. Extensive experiments demonstrate that Interact3D successfully produces promising collsion-aware compositions with improved geometric fidelity and consistent spatial relationships.
[40] Parallel In-context Learning for Large Vision Language Models cs.CV | cs.AI | cs.LGPDF
Shin’ya Yamaguchi, Daiki Chijiwa, Tamao Sakao, Taku Hasegawa
TL;DR: 本文提出了一种名为Parallel-ICL的即插即用推理算法,用于解决大型视觉语言模型(LVLMs)在多模态上下文学习(MM-ICL)中,增加演示示例数量会显著增加推理延迟的问题。该方法通过将长演示上下文分割成多个短块并行处理,并使用基于加权专家乘积(PoE)的集成方法在logit层面融合预测,从而在保持性能的同时大幅提升推理速度。
Details
Motivation: 动机在于解决MM-ICL中性能与效率的权衡问题:增加演示示例能提升模型性能,但由于Transformer注意力机制的计算成本与上下文长度呈二次方关系,会导致推理延迟显著增加。
Result: 在VQA、图像描述和分类等多个基准测试上的广泛实验表明,Parallel-ICL在性能上可与完整上下文的MM-ICL相媲美,同时显著提高了推理速度。
Insight: 创新点在于提出了一个并行化处理长上下文的推理框架,并基于集成学习理论引入了原则性策略:基于聚类的上下文分块以最大化块间多样性,以及基于相似性的上下文编译以根据查询相关性加权预测。这为解决MM-ICL中的精度-效率权衡提供了一个有效方案。
Abstract: Large vision-language models (LVLMs) employ multi-modal in-context learning (MM-ICL) to adapt to new tasks by leveraging demonstration examples. While increasing the number of demonstrations boosts performance, they incur significant inference latency due to the quadratic computational cost of Transformer attention with respect to the context length. To address this trade-off, we propose Parallel In-Context Learning (Parallel-ICL), a plug-and-play inference algorithm. Parallel-ICL partitions the long demonstration context into multiple shorter, manageable chunks. It processes these chunks in parallel and integrates their predictions at the logit level, using a weighted Product-of-Experts (PoE) ensemble to approximate the full-context output. Guided by ensemble learning theory, we introduce principled strategies for Parallel-ICL: (i) clustering-based context chunking to maximize inter-chunk diversity and (ii) similarity-based context compilation to weight predictions by query relevance. Extensive experiments on VQA, image captioning, and classification benchmarks demonstrate that Parallel-ICL achieves performance comparable to full-context MM-ICL, while significantly improving inference speed. Our work offers an effective solution to the accuracy-efficiency trade-off in MM-ICL, enabling dynamic task adaptation with substantially reduced inference overhead.
[41] LICA: Layered Image Composition Annotations for Graphic Design Research cs.CV | cs.AIPDF
Elad Hirsch, Shubham Yadav, Mohit Garg, Purvanshi Mehta
TL;DR: LICA是一个大规模的多层图形设计数据集,包含1,550,244个设计作品,旨在推动图形布局的结构化理解和生成研究。该数据集不仅提供渲染的PNG图像,还将每个设计表示为包含文本、图像、矢量、组等类型组件的层次化组合,并附有丰富的元素级元数据。此外,LICA还引入了图形设计视频作为视觉语言模型的新挑战,包含27,261个带关键帧和运动参数的动画布局。
Details
Motivation: 当前缺乏能够支持图形设计结构化理解和生成的大规模数据集,现有研究多局限于像素级分析,难以深入理解设计中的层次化组合关系。LICA旨在填补这一空白,为研究设计结构而非仅像素的模型提供数据基础。
Result: LICA数据集规模庞大,涵盖20个设计类别和971,850个独特模板,覆盖了广泛的真实世界设计结构。它为新研究任务(如层感知修复、结构化布局生成、可控设计编辑和时间感知生成建模)建立了基准。
Insight: LICA的创新点在于将设计表示为组合层和关系的系统,支持直接对设计结构进行操作的模型研究。这为图形设计研究提供了新的数据范式,特别是引入了图形设计视频作为视觉语言模型的新挑战领域,推动了从像素级到结构级理解的转变。
Abstract: We introduce LICA (Layered Image Composition Annotations), a large-scale dataset of 1,550,244 multi-layer graphic design compositions designed to advance structured understanding and generation of graphic layouts1. In addition to ren- dered PNG images, LICA represents each design as a hierarchical composition of typed components including text, image, vector, and group elements, each paired with rich per-element metadata such as spatial geometry, typographic attributes, opacity, and visibility. The dataset spans 20 design categories and 971,850 unique templates, providing broad coverage of real-world design structures. We further introduce graphic design video as a new and largely unexplored challenge for current vision-language models through 27,261 animated layouts annotated with per-component keyframes and motion parameters. Beyond scale, LICA establishes a new paradigm of research tasks for graphic design, enabling structured investiga- tions into problems such as layer-aware inpainting, structured layout generation, controlled design editing, and temporally-aware generative modeling. By repre- senting design as a system of compositional layers and relationships, the dataset supports research on models that operate directly on design structure rather than pixels alone.
[42] OneWorld: Taming Scene Generation with 3D Unified Representation Autoencoder cs.CVPDF
Sensen Gao, Zhaoqing Wang, Qihang Cao, Dongdong Yu, Changhu Wang
TL;DR: 本文提出了OneWorld框架,通过3D统一表示自编码器(3D-URAE)直接在连贯的3D表示空间中进行扩散,以解决现有基于扩散的3D场景生成方法在2D图像/视频潜在空间中操作时难以保持跨视图外观和几何一致性的问题。
Details
Motivation: 现有基于扩散的3D场景生成方法主要在2D图像/视频潜在空间中操作,导致跨视图外观和几何一致性难以保证,因此需要一种直接在3D表示空间中进行生成的方法。
Result: 综合实验表明,OneWorld生成的3D场景质量高,在跨视图一致性方面优于当前最先进的基于2D的方法。
Insight: 创新点包括:1) 3D统一表示自编码器(3D-URAE),利用预训练的3D基础模型并注入外观和语义蒸馏;2) 令牌级跨视图对应(CVC)一致性损失,显式强制跨视图结构对齐;3) 流形漂移强制(MDF),缓解训练-推理暴露偏差并塑造鲁棒的3D流形。
Abstract: Existing diffusion-based 3D scene generation methods primarily operate in 2D image/video latent spaces, which makes maintaining cross-view appearance and geometric consistency inherently challenging. To bridge this gap, we present OneWorld, a framework that performs diffusion directly within a coherent 3D representation space. Central to our approach is the 3D Unified Representation Autoencoder (3D-URAE); it leverages pretrained 3D foundation models and augments their geometry-centric nature by injecting appearance and distilling semantics into a unified 3D latent space. Furthermore, we introduce token-level Cross-View-Correspondence (CVC) consistency loss to explicitly enforce structural alignment across views, and propose Manifold-Drift Forcing (MDF) to mitigate train-inference exposure bias and shape a robust 3D manifold by mixing drifted and original representations. Comprehensive experiments demonstrate that OneWorld generates high-quality 3D scenes with superior cross-view consistency compared to state-of-the-art 2D-based methods. Our code will be available at https://github.com/SensenGao/OneWorld.
[43] NanoGS: Training-Free Gaussian Splat Simplification cs.CV | cs.GRPDF
Butian Xiong, Rong Liu, Tiantian Zhou, Meida Chen, Zhiwen Fan
TL;DR: NanoGS是一种无需训练、轻量级的3D高斯泼溅简化框架,通过局部成对合并高斯原语来减少模型中的原语数量,从而降低存储和传输成本,同时保持渲染保真度。
Details
Motivation: 解决3D高斯泼溅(3DGS)因使用数百万个原语而导致的高存储和传输开销问题,现有压缩方法依赖GPU密集型后训练优化和校准图像,限制了实际部署。
Result: 实验表明,NanoGS显著减少了原语数量,同时保持了高渲染保真度,提供了高效实用的简化解决方案。
Insight: 创新点在于将简化问题形式化为稀疏空间图上的局部成对合并,使用质量保持矩匹配近似高斯对,并通过原则性合并成本评估合并质量,无需图像渲染监督,可在CPU上高效运行,与现有渲染管道无缝集成。
Abstract: 3D Gaussian Splat (3DGS) enables high-fidelity, real-time novel view synthesis by representing scenes with large sets of anisotropic primitives, but often requires millions of Splats, incurring significant storage and transmission costs. Most existing compression methods rely on GPU-intensive post-training optimization with calibrated images, limiting practical deployment. We introduce NanoGS, a training-free and lightweight framework for Gaussian Splat simplification. Instead of relying on image-based rendering supervision, NanoGS formulates simplification as local pairwise merging over a sparse spatial graph. The method approximates a pair of Gaussians with a single primitive using mass preserved moment matching and evaluates merge quality through a principled merge cost between the original mixture and its approximation. By restricting merge candidates to local neighborhoods and selecting compatible pairs efficiently, NanoGS produces compact Gaussian representations while preserving scene structure and appearance. NanoGS operates directly on existing Gaussian Splat models, runs efficiently on CPU, and preserves the standard 3DGS parameterization, enabling seamless integration with existing rendering pipelines. Experiments demonstrate that NanoGS substantially reduces primitive count while maintaining high rendering fidelity, providing an efficient and practical solution for Gaussian Splat simplification. Our project website is available at https://saliteta.github.io/NanoGS/.
[44] PathGLS: Evaluating Pathology Vision-Language Models without Ground Truth through Multi-Dimensional Consistency cs.CV | cs.AIPDF
Minbing Chen, Zhu Meng, Fei Su
TL;DR: 本文提出了PathGLS,一种无需参考标准(ground truth)的评估框架,用于评估病理学视觉语言模型。该框架通过三个维度——定位(细粒度视觉-文本对齐)、逻辑(使用自然语言推理的蕴含图一致性)和稳定性(对抗性视觉-语义扰动下的输出方差)——来综合评估模型,生成一个全面的信任分数。实验在多个数据集上进行,证明了其有效性。
Details
Motivation: 解决病理学视觉语言模型缺乏可靠、自动化评估指标的问题,特别是为了识别幻觉等细微错误,从而促进其在临床中的广泛采用。
Result: 在Quilt-1M数据集上,PathGLS对幻觉报告的敏感性下降高达40.2%,远优于BERTScore的2.1%。在与专家定义的临床错误层次结构验证中,PathGLS实现了Spearman秩相关ρ=0.71,显著优于基于大语言模型的方法(如Gemini 3.0 Pro的ρ=0.39)。
Insight: 创新点在于提出了一个无需参考标准的、多维度(定位、逻辑、稳定性)的评估框架,能够量化幻觉率和领域偏移鲁棒性,为在私有临床数据集上基准测试VLM和指导安全部署提供了可靠标准。
Abstract: Vision-Language Models (VLMs) offer significant potential in computational pathology by enabling interpretable image analysis, automated reporting, and scalable decision support. However, their widespread clinical adoption remains limited due to the absence of reliable, automated evaluation metrics capable of identifying subtle failures such as hallucinations. To address this gap, we propose PathGLS, a novel reference-free evaluation framework that assesses pathology VLMs across three dimensions: Grounding (fine-grained visual-text alignment), Logic (entailment graph consistency using Natural Language Inference), and Stability (output variance under adversarial visual-semantic perturbations). PathGLS supports both patch-level and whole-slide image (WSI)-level analysis, yielding a comprehensive trust score. Experiments on Quilt-1M, TCGA, REG2025, PathMMU and TCGA-Sarcoma datasets demonstrate the superiority of PathGLS. Specifically, on the Quilt-1M dataset, PathGLS reveals a steep sensitivity drop of 40.2% for hallucinated reports compared to only 2.1% for BERTScore. Moreover, validation against expert-defined clinical error hierarchies reveals that PathGLS achieves a strong Spearman’s rank correlation of $ρ=0.71$ ($p < 0.0001$), significantly outperforming Large Language Model (LLM)-based approaches (Gemini 3.0 Pro: $ρ=0.39$, $p < 0.0001$). These results establish PathGLS as a robust reference-free metric. By directly quantifying hallucination rates and domain shift robustness, it serves as a reliable criterion for benchmarking VLMs on private clinical datasets and informing safe deployment. Code can be found at: https://github.com/My13ad/PathGLS
[45] Boosting Quantitive and Spatial Awareness for Zero-Shot Object Counting cs.CVPDF
Da Zhang, Bingyu Li, Feiyu Wang, Zhiyuan Zhao, Junyu Gao
TL;DR: 该论文提出了QICA框架,通过结合数量感知和空间聚合来解决零样本物体计数(ZSOC)中存在的数量感知不足和空间不敏感问题。具体方法包括协同提示策略(SPS)、成本聚合解码器(CAD)和多级数量对齐损失(L_MQA),以增强模型对数量的细粒度理解和跨域泛化能力。
Details
Motivation: 现有零样本物体计数方法通常将计数视为粗粒度检索任务,缺乏细粒度数量感知,并且在模型适应过程中因特征空间扭曲导致空间不敏感和泛化能力下降。
Result: 在FSC-147数据集上取得了有竞争力的性能,在CARPK和ShanghaiTech-A数据集上的零样本评估验证了其对于未见领域的优越泛化能力。
Insight: 创新点包括通过数值条件提示桥接语义识别与数量推理的协同提示策略(SPS),以及直接在视觉-文本相似度图上操作以防止过拟合并保持零样本可迁移性的成本聚合解码器(CAD)。
Abstract: Zero-shot object counting (ZSOC) aims to enumerate objects of arbitrary categories specified by text descriptions without requiring visual exemplars. However, existing methods often treat counting as a coarse retrieval task, suffering from a lack of fine-grained quantity awareness. Furthermore, they frequently exhibit spatial insensitivity and degraded generalization due to feature space distortion during model adaptation.To address these challenges, we present \textbf{QICA}, a novel framework that synergizes \underline{q}uantity percept\underline{i}on with robust spatial \underline{c}ast \underline{a}ggregation. Specifically, we introduce a Synergistic Prompting Strategy (\textbf{SPS}) that adapts vision and language encoders through numerically conditioned prompts, bridging the gap between semantic recognition and quantitative reasoning. To mitigate feature distortion, we propose a Cost Aggregation Decoder (\textbf{CAD}) that operates directly on vision-text similarity maps. By refining these maps through spatial aggregation, CAD prevents overfitting while preserving zero-shot transferability. Additionally, a multi-level quantity alignment loss ($\mathcal{L}_{MQA}$) is employed to enforce numerical consistency across the entire pipeline. Extensive experiments on FSC-147 demonstrate competitive performance, while zero-shot evaluation on CARPK and ShanghaiTech-A validates superior generalization to unseen domains.
[46] Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training cs.CVPDF
Peng Sun, Jun Xie, Tao Lin
TL;DR: 本文提出了一种名为IOMM(Image-Only Training for UMMs)的两阶段高效训练框架,旨在解决统一多模态模型(UMM)中视觉生成组件预训练效率低下且依赖稀缺高质量图文配对数据的问题。该框架首先仅使用大量无标签图像数据进行视觉生成组件的预训练,然后使用少量图文对和未标记图像的混合数据进行微调。实验表明,IOMM不仅提升了训练效率,还在多个基准上达到了最先进的性能。
Details
Motivation: 统一多模态模型(UMMs)的视觉生成组件预训练通常受限于低效的训练范式和稀缺的高质量图文配对数据,这两个问题被识别为主要的性能瓶颈。
Result: 在GenEval基准上达到0.89分,在WISE基准上达到0.55分,超越了BAGEL-7B(0.82 & 0.55)和BLIP3-o-4B(0.84 & 0.50)等强基线模型,实现了最先进的性能。
Insight: 核心创新点是提出了一个两阶段的、数据高效的训练框架,通过先进行仅图像的掩码建模预训练来消除对配对数据的依赖并大幅降低计算成本,再通过少量图文对微调来提升指令对齐和生成质量。这种方法将昂贵的配对数据需求转移到了更轻量的微调阶段,是一种高效且有效的预训练策略。
Abstract: Unified Multimodal Models (UMMs) are often constrained by the pre-training of their $\textbf{visual generation components}$, which typically relies on inefficient paradigms and scarce, high-quality text-image paired data. In this paper, we systematically analyze pre-training recipes for $\textbf{UMM visual generation}$ and identify these two issues as the major bottlenecks. To address them, we propose $\textbf{Image-Only Training for UMMs (IOMM)}$, a data-efficient two-stage training framework. The first stage pre-trains the visual generative component $\textbf{exclusively}$ using abundant unlabeled image-only data, thereby removing the dependency on paired data $\textbf{for this costly phase}$. The second stage fine-tunes the model using a mixture of unlabeled images and a small curated set of text-image pairs, leading to improved instruction alignment and generative quality. Extensive experiments show that IOMM not only improves training efficiency but also achieves state-of-the-art (SOTA) performance. For example, our IOMM-B (3.6B) model was trained from scratch using only $\sim \textbf{1050}$ H800 GPU hours (with the vast majority, $\textbf{1000}$ hours, dedicated to the efficient $\textbf{image-only pre-training stage}$). It achieves $\textbf{0.89}$ on GenEval and $\textbf{0.55}$ on WISE–surpassing strong baselines such as BAGEL-7B (0.82 & 0.55) and BLIP3-o-4B (0.84 & 0.50). Code is available $\href{https://github.com/LINs-lab/IOMM}{https://github.com/LINs-lab/IOMM}$.
[47] GATS: Gaussian Aware Temporal Scaling Transformer for Invariant 4D Spatio-Temporal Point Cloud Representation cs.CV | cs.AIPDF
Jiayi Tian, Jiaze Wang
TL;DR: 本文提出了一种名为GATS(高斯感知时序缩放Transformer)的双重不变性框架,用于解决4D点云视频理解中的时序尺度偏差和分布不确定性挑战。该框架包含不确定性引导高斯卷积(UGGC)和时序缩放注意力(TSA)两个互补模块,分别在点卷积中引入局部高斯统计与不确定性门控以增强对密度变化、噪声和遮挡的鲁棒性,并通过可学习的缩放因子归一化时序距离以确保帧率不变性。
Details
Motivation: 现有基于CNN或Transformer的4D点云视频处理方法受限于有限感受野或二次计算复杂度,且忽略了不同帧率下的时序尺度偏差以及不规则点云的分布不确定性,难以设计统一且鲁棒的4D骨干网络。
Result: 在主流基准测试MSR-Action3D(准确率提升+6.62%)、NTU RGBD(准确率提升+1.4%)和Synthia4D(mIoU提升+1.8%)上取得了显著性能提升,相比基于Transformer的方法在准确性、鲁棒性和可扩展性方面表现更优。
Insight: 创新点在于显式地同时处理分布不一致性和时序尺度偏差:UGGC模块将高斯统计与不确定性感知结合到点卷积中,提升对点云不规则性的鲁棒性;TSA模块通过可学习缩放因子实现时序距离归一化,确保帧划分不变性。两个模块互补协作,时序缩放为高斯估计提供归一化时间间隔,而高斯建模增强了分布鲁棒性,为不变性4D点云视频理解提供了更高效、原理性强的范式。
Abstract: Understanding 4D point cloud videos is essential for enabling intelligent agents to perceive dynamic environments. However, temporal scale bias across varying frame rates and distributional uncertainty in irregular point clouds make it highly challenging to design a unified and robust 4D backbone. Existing CNN or Transformer based methods are constrained either by limited receptive fields or by quadratic computational complexity, while neglecting these implicit distortions. To address this problem, we propose a novel dual invariant framework, termed \textbf{Gaussian Aware Temporal Scaling (GATS)}, which explicitly resolves both distributional inconsistencies and temporal. The proposed \emph{Uncertainty Guided Gaussian Convolution (UGGC)} incorporates local Gaussian statistics and uncertainty aware gating into point convolution, thereby achieving robust neighborhood aggregation under density variation, noise, and occlusion. In parallel, the \emph{Temporal Scaling Attention (TSA)} introduces a learnable scaling factor to normalize temporal distances, ensuring frame partition invariance and consistent velocity estimation across different frame rates. These two modules are complementary: temporal scaling normalizes time intervals prior to Gaussian estimation, while Gaussian modeling enhances robustness to irregular distributions. Our experiments on mainstream benchmarks MSR-Action3D (\textbf{+6.62%} accuracy), NTU RGBD (\textbf{+1.4%} accuracy), and Synthia4D (\textbf{+1.8%} mIoU) demonstrate significant performance gains, offering a more efficient and principled paradigm for invariant 4D point cloud video understanding with superior accuracy, robustness, and scalability compared to Transformer based counterparts.
[48] Segmentation-before-Staining Improves Structural Fidelity in Virtual IHC-to-Multiplex IF Translation cs.CVPDF
Junhyeok Lee, Han Jang, Heeseong Eum, Joon Jang, Kyu Sung Choi
TL;DR: 本文提出了一种名为’分割先于染色’的无监督、架构无关的条件化策略,用于改进从虚拟免疫组化(IHC)到多重免疫荧光(mIF)的转换。该方法通过引入预训练细胞核分割基础模型生成的连续细胞概率图作为显式输入先验,并结合一个保持方差的正则化项来匹配局部强度统计,从而在合成荧光通道中保持细胞水平的异质性。实验表明,该方法在多个模型架构和数据集上均能一致地提升细胞核计数保真度和感知质量。
Details
Motivation: 多重免疫荧光(mIF)成本高、流程复杂,限制了其临床应用。虚拟染色技术可以从广泛可用的明场免疫组化(IHC)合成mIF通道,但现有方法主要优化像素级保真度,未明确约束细胞核形态。在病理学中,细胞核数量、形状或空间排列的细微失真会直接影响如Ki67增殖指数等量化终点,导致临床风险分类错误。
Result: 在Pix2Pix(使用U-Net和ResNet生成器)、确定性回归U-Net和条件扩散模型上,于两个独立数据集上进行的受控实验表明,该方法作为唯一修改,在细胞核计数保真度和感知质量方面取得了一致的改进。
Insight: 核心创新点在于将连续细胞概率图作为显式先验输入,这比二值化阈值保留了更丰富的梯度级边界信息,提供了更强的条件化信号,且无需针对特定任务进行调优。同时,方差保持正则化项有助于维持合成通道中的细胞水平异质性。这是一种架构无关的改进策略,可提升虚拟染色中细胞结构(尤其是细胞核)的保真度。
Abstract: Multiplex immunofluorescence (mIF) enables simultaneous single-cell quantification of multiple biomarkers within intact tissue architecture, yet its high reagent cost, multi-round staining protocols, and need for specialized imaging platforms limit routine clinical adoption. Virtual staining can synthesize mIF channels from widely available brightfield immunohistochemistry (IHC), but current translators optimize pixel-level fidelity without explicitly constraining nuclear morphology. In pathology, this gap is clinically consequential: subtle distortions in nuclei count, shape, or spatial arrangement propagate directly to quantification endpoints such as the Ki67 proliferation index, where errors of a few percent can shift treatment-relevant risk categories. This work introduces a supervision-free, architecture-agnostic conditioning strategy that injects a continuous cell probability map from a pretrained nuclei segmentation foundation model as an explicit input prior, together with a variance-preserving regularization term that matches local intensity statistics to maintain cell-level heterogeneity in synthesized fluorescence channels. The soft prior retains gradient-level boundary information lost by binary thresholding, providing a richer conditioning signal without task-specific tuning. Controlled experiments across Pix2Pix with U-Net and ResNet generators, deterministic regression U-Net, and conditional diffusion on two independent datasets demonstrate consistent improvements in nuclei count fidelity and perceptual quality, as the sole modifications. Code will be made publicly available upon acceptance.
[49] 360° Image Perception with MLLMs: A Comprehensive Benchmark and a Training-Free Method cs.CV | cs.AIPDF
Huyen T. T. Tran, Van-Quang Nguyen, Farros Alferro, Kang-Jun Liu, Takayuki Okatani
TL;DR: 本文针对多模态大语言模型(MLLMs)在360°全景图像感知方面的不足,提出了一个名为360Bench的综合性视觉问答(VQA)基准,并开发了一种无需训练的、基于场景图的框架Free360来提升MLLMs在全景图像上的理解和推理能力。
Details
Motivation: MLLMs在常规图像理解上表现出色,但在360°全景图像感知方面研究不足。全景图像能捕捉完整环境,但也带来了几何畸变和复杂空间关系等挑战,需要专门的评估和解决方案。
Result: 在提出的360Bench基准上系统评估了7个MLLMs和6种增强方法,揭示了它们在360°图像感知上的短板。提出的Free360框架能持续提升其基础MLLM的性能,为360° VQA任务提供了一个强大的无需训练的解决方案。
Insight: 论文的创新点在于:1)构建了首个针对高分辨率360°全景图像的综合性VQA基准(360Bench),用于系统评估;2)提出了一种无需训练的模块化框架(Free360),通过自适应球面图像变换和统一图表示来分解并整合推理过程,有效应对全景图像的独特挑战。
Abstract: Multimodal Large Language Models (MLLMs) have shown impressive abilities in understanding and reasoning over conventional images. However, their perception of 360° images remains largely underexplored. Unlike conventional images, 360° images capture the entire surrounding environment, enabling holistic spatial reasoning but introducing challenges such as geometric distortion and complex spatial relations. To comprehensively assess MLLMs’ capabilities to perceive 360° images, we introduce 360Bench, a Visual Question Answering (VQA) benchmark featuring 7K-resolution 360° images, seven representative (sub)tasks with annotations carefully curated by human annotators. Using 360Bench, we systematically evaluate seven MLLMs and six enhancement methods, revealing their shortcomings in 360° image perception. To address these challenges, we propose Free360, a training-free scene-graph-based framework for high-resolution 360° VQA. Free360 decomposes the reasoning process into modular steps, applies adaptive spherical image transformations to 360° images tailored to each step, and seamlessly integrates the resulting information into a unified graph representation for answer generation. Experiments show that Free360 consistently improves its base MLLM and provides a strong training-free solution for 360° VQA tasks. The source code and dataset will be publicly released upon acceptance.
[50] KidsNanny: A Two-Stage Multimodal Content Moderation Pipeline Integrating Visual Classification, Object Detection, OCR, and Contextual Reasoning for Child Safety cs.CV | cs.CRPDF
Viraj Panchal, Tanmay Talsaniya, Parag Patel, Meet Patel
TL;DR: 本文提出了KidsNanny,一个用于儿童安全的两阶段多模态内容审核架构。第一阶段结合视觉Transformer和物体检测器进行视觉筛查;第二阶段接收文本信息,应用OCR和基于文本的7B语言模型进行上下文推理。在UnsafeBench数据集上评估,该架构在精度、F1分数和延迟方面均优于基线模型。
Details
Motivation: 解决现有儿童安全内容审核系统在处理依赖文本的威胁时效率低下、延迟高的问题,旨在开发一个高效、低延迟的多模态审核管道。
Result: 在UnsafeBench的Sexual类别(1054张图像)上评估。第一阶段(仅视觉)准确率80.27%,F1分数85.39%,延迟11.7毫秒,优于视觉基线(59.01%-77.04%)。完整两阶段管道准确率81.40%,F1分数86.16%,延迟120毫秒,优于ShieldGemma-2(64.80%准确率,1136毫秒)和LlavaGuard(80.36%准确率,4138毫秒)。在依赖文本的子集上,KidsNanny召回率达100%(小样本),精度75.76%。
Insight: 创新点在于两阶段架构设计,将视觉分类/检测与基于OCR的文本上下文推理解耦,通过传递文本而非原始像素来降低计算开销。这为高效的多模态内容审核提供了新思路,尤其在处理文本嵌入威胁时可能具有召回率和延迟优势。
Abstract: We present KidsNanny, a two-stage multimodal content moderation architecture for child safety. Stage 1 combines a vision transformer (ViT) with an object detector for visual screening (11.7 ms); outputs are routed as text not raw pixels to Stage 2, which applies OCR and a text based 7B language model for contextual reasoning (120 ms total pipeline). We evaluate on the UnsafeBench Sexual category (1,054 images) under two regimes: vision-only, isolating Stage 1, and multimodal, evaluating the full Stage 1+2 pipeline. Stage 1 achieves 80.27% accuracy and 85.39% F1 at 11.7 ms; vision-only baselines range from 59.01% to 77.04% accuracy. The full pipeline achieves 81.40% accuracy and 86.16% F1 at 120 ms, compared to ShieldGemma-2 (64.80% accuracy, 1,136 ms) and LlavaGuard (80.36% accuracy, 4,138 ms). To evaluate text-awareness, we filter two subsets: a text+visual subset (257 images) and a text-only subset (44 images where safety depends primarily on embedded text). On text-only images, KidsNanny achieves 100% recall (25/25 positives; small sample) and 75.76% precision; ShieldGemma-2 achieves 84% recall and 60% precision at 1,136 ms. Results suggest that dedicated OCR-based reasoning may offer recall-precision advantages on text-embedded threats at lower latency, though the small text-only subset limits generalizability. By documenting this architecture and evaluation methodology, we aim to contribute to the broader research effort on efficient multimodal content moderation for child safety.
[51] ECHO: Edge-Cloud Humanoid Orchestration for Language-to-Motion Control cs.CVPDF
Haozhe Jia, Jianfei Song, Yuan Zhang, Honglei Jin, Youcheng Fan
TL;DR: 本文提出了ECHO框架,用于实现基于语言指令的人形机器人全身运动控制。该框架采用云边协同架构:云端部署基于扩散模型的文本到运动生成器,从自然语言指令合成运动参考;边缘端部署强化学习跟踪器,在机器人上闭环执行这些运动。两者通过一个紧凑的、机器人原生的38维运动表示进行桥接。
Details
Motivation: 解决如何利用自然语言指令直接、安全、高效地控制人形机器人进行复杂全身运动的问题,避免传统方法中从人体模型重定向到机器人所需的推理时适配,并实现从仿真到真实世界的零硬件微调迁移。
Result: 在基于HumanML3D重定向的基准测试中,在统一的机器人领域评估器下取得了强大的生成质量(FID 0.029, R-Precision Top-1 0.686),同时保持了高运动安全性和轨迹一致性。在Unitree G1人形机器人上的真实世界实验表明,无需硬件微调即可稳定执行多样化的文本指令。
Insight: 创新点包括:1)云边协同的模块化架构设计,分离了高计算需求的运动生成与低延迟的运动执行;2)紧凑的机器人原生运动表示,直接兼容底层控制,消除了重定向步骤;3)采用教师-学生范式与证据适应模块的强化学习跟踪器,增强了仿真到真实的迁移鲁棒性;4)集成了基于IMU的自主跌倒检测与恢复机制。从客观角度看,将扩散模型生成与强化学习跟踪在机器人控制中结合,并通过专用表示和适应模块解决sim-to-real gap,是一个有前景的系统级解决方案。
Abstract: We present ECHO, an edge–cloud framework for language-driven whole-body control of humanoid robots. A cloud-hosted diffusion-based text-to-motion generator synthesizes motion references from natural language instructions, while an edge-deployed reinforcement-learning tracker executes them in closed loop on the robot. The two modules are bridged by a compact, robot-native 38-dimensional motion representation that encodes joint angles, root planar velocity, root height, and a continuous 6D root orientation per frame, eliminating inference-time retargeting from human body models and remaining directly compatible with low-level PD control. The generator adopts a 1D convolutional UNet with cross-attention conditioned on CLIP-encoded text features; at inference, DDIM sampling with 10 denoising steps and classifier-free guidance produces motion sequences in approximately one second on a cloud GPU. The tracker follows a Teacher–Student paradigm: a privileged teacher policy is distilled into a lightweight student equipped with an evidential adaptation module for sim-to-real transfer, further strengthened by morphological symmetry constraints and domain randomization. An autonomous fall recovery mechanism detects falls via onboard IMU readings and retrieves recovery trajectories from a pre-built motion library. We evaluate ECHO on a retargeted HumanML3D benchmark, where it achieves strong generation quality (FID 0.029, R-Precision Top-1 0.686) under a unified robot-domain evaluator, while maintaining high motion safety and trajectory consistency. Real-world experiments on a Unitree G1 humanoid demonstrate stable execution of diverse text commands with zero hardware fine-tuning.
[52] Reliable Reasoning in SVG-LLMs via Multi-Task Multi-Reward Reinforcement Learning cs.CVPDF
Haomin Wang, Qi Wei, Qianli Ma, Shengyuan Ding, Jinhui Yin
TL;DR: 本文提出CTRL-S框架,通过引入思维链机制和多奖励强化学习,提升SVG生成模型的结构连贯性、视觉保真度和推理可靠性。
Details
Motivation: 现有SVG生成方法存在泛化能力有限、代码冗余和缺乏显式推理的问题,本文旨在通过结构化推理和多任务多奖励优化解决这些问题。
Result: 在SVG-Sophia数据集上的实验表明,CTRL-S在任务成功率、SVG代码质量和视觉保真度方面优于现有方法,达到SOTA水平。
Insight: 创新点包括引入思维链机制显式暴露推理过程、构建高质量多任务SVG数据集SVG-Sophia,以及采用GRPO算法结合多奖励优化框架提升生成能力。
Abstract: With the rapid advancement of vision-language models, an increasing number of studies have explored their potential for SVG generation tasks. Although existing approaches improve performance by constructing large-scale SVG datasets and introducing SVG-specific tokens, they still suffer from limited generalization, redundant paths in code outputs, and a lack of explicit reasoning. In this work, we present CTRL-S (Chain-of-Thought Reinforcement Learning for SVG), a unified framework that introduces a chain-of-thought mechanism to explicitly expose the model’s reasoning process during SVG generation. To support this structured reasoning, we construct SVG-Sophia, a high-quality dataset containing 145K samples across SVG code refinement, Text-to-SVG, and Image-to-SVG tasks. By training the model to generate group-level structured SVG code, CTRL-S significantly improves structural coherence and visual fidelity. Furthermore, we adopt the GRPO algorithm and design a multi-reward optimization framework, incorporating DINO, image-text similarity, format, and code efficiency rewards. Through joint multi-reward optimization and multi-task training, our approach systematically enhances overall generation capabilities. Extensive experiments show that CTRL-S outperforms existing methods, achieving higher task success rates, superior SVG code quality, and exceptional visual fidelity.
[53] S-VAM: Shortcut Video-Action Model by Self-Distilling Geometric and Semantic Foresight cs.CV | cs.ROPDF
Haodong Yan, Zhide Zhong, Jiaguan Zhu, Junjie He, Weilin Yuan
TL;DR: 本文提出S-VAM(Shortcut Video-Action Model),一种通过单次前向传播预测连贯几何和语义表示的快捷视频动作模型,旨在解决现有视频动作模型无法同时保证实时推理和高保真度预测的问题。该方法采用新颖的自蒸馏策略,将多步去噪的结构化生成先验压缩到一步推理中,从而在复杂操作任务中实现高效且精确的机器人操控。
Details
Motivation: 当前视频动作模型(VAMs)通常依赖缓慢的多步视频生成或噪声较大的一步特征提取,无法同时满足实时推理和高保真度预测的需求,限制了其在机器人学习中的应用。
Result: 在仿真和真实世界的大量实验中,S-VAM超越了现有最先进方法,能够在复杂环境中实现高效且精确的操控。
Insight: 创新点在于通过自蒸馏策略,利用从扩散模型自身多步生成视频中提取的视觉基础模型(VFM)表示作为教师目标,训练轻量级解耦器(学生)直接映射噪声一步特征,从而将生成先验压缩到一步推理中,实现了实时高保真预测的平衡。
Abstract: Video action models (VAMs) have emerged as a promising paradigm for robot learning, owing to their powerful visual foresight for complex manipulation tasks. However, current VAMs, typically relying on either slow multi-step video generation or noisy one-step feature extraction, cannot simultaneously guarantee real-time inference and high-fidelity foresight. To address this limitation, we propose S-VAM, a shortcut video-action model that foresees coherent geometric and semantic representations via a single forward pass. Serving as a stable blueprint, these foreseen representations significantly simplify the action prediction. To enable this efficient shortcut, we introduce a novel self-distillation strategy that condenses structured generative priors of multi-step denoising into one-step inference. Specifically, vision foundation model (VFM) representations extracted from the diffusion model’s own multi-step generated videos provide teacher targets. Lightweight decouplers, as students, learn to directly map noisy one-step features to these targets. Extensive experiments in simulation and the real world demonstrate that our S-VAM outperforms state-of-the-art methods, enabling efficient and precise manipulation in complex environments. Our project page is https://haodong-yan.github.io/S-VAM/
[54] Leveling3D: Leveling Up 3D Reconstruction with Feed-Forward 3D Gaussian Splatting and Geometry-Aware Generation cs.CVPDF
Yiming Huang, Baixiang Huang, Beilei Cui, Chi Kit Ng, Long Bai
TL;DR: Leveling3D是一个新颖的3D重建与生成流程,它将前馈式3D高斯溅射重建与几何感知的生成模型相结合,旨在同时进行整体性的重建和生成。该方法通过一个轻量级的几何感知适配器,对齐扩散模型的内部知识与前馈模型的几何先验,从而修复由3D表示中约束不足区域导致的渲染伪影和缺失区域。
Details
Motivation: 解决现有基于扩散模型修复前馈3D重建渲染结果的方法缺乏几何关注,在填补外推视角的缺失区域时失败的问题。
Result: 在公开数据集上实现了最先进的性能,包括在新视角合成和深度估计等任务上。
Insight: 核心创新点在于提出了几何感知的适配器,将生成模型的语义知识与重建模型的几何先验对齐;同时引入了调色板过滤训练策略和测试时掩码细化,以提升生成多样性和修复边界质量;更重要的是,该方法形成了一个闭环,增强的外推视图可以反馈给前馈重建模型,从而提升整体3D重建质量。
Abstract: Feed-forward 3D reconstruction has revolutionized 3D vision, providing a powerful baseline for downstream tasks such as novel-view synthesis with 3D Gaussian Splatting. Previous works explore fixing the corrupted rendering results with a diffusion model. However, they lack geometric concern and fail at filling the missing area on the extrapolated view. In this work, we introduce Leveling3D, a novel pipeline that integrates feed-forward 3D reconstruction with geometrical-consistent generation to enable holistic simultaneous reconstruction and generation. We propose a geometry-aware leveling adapter, a lightweight technique that aligns internal knowledge in the diffusion model with the geometry prior from the feed-forward model. The leveling adapter enables generation on the artifact area of the extrapolated novel views caused by underconstrained regions of the 3D representation. Specifically, to learn a more diverse distributed generation, we introduce the palette filtering strategy for training, and a test-time masking refinement to prevent messy boundaries along the fixing regions. More importantly, the enhanced extrapolated novel views from Leveling3D could be used as the inputs for feed-forward 3DGS, leveling up the 3D reconstruction. We achieve SOTA performance on public datasets, including tasks such as novel-view synthesis and depth estimation.
[55] Exclusivity-Guided Mask Learning for Semi-Supervised Crowd Instance Segmentation and Counting cs.CVPDF
Jiyang Huang, Hongru Cheng, Wei Lin, Jia Wan, Antoni B. Chan
TL;DR: 本文提出了一种用于半监督人群实例分割与计数的排他性引导掩码学习方法。首先,基于最近邻排他圆约束,设计了EDP-SAM模型为现有数据集生成掩码监督。然后,提出了XMask方法,通过一个判别性掩码目标来强制空间分离,并利用高斯平滑和可微分中心采样策略提升特征连续性和训练稳定性。在此基础上,构建了一个半监督人群计数框架,将实例掩码先验作为比传统点标注更丰富的伪标签。
Details
Motivation: 半监督人群分析中,未标注数据丰富且廉价,但传统的基于点的标注因个体区域固有的模糊性而限制了性能,从稀疏标注中学习细粒度结构语义仍是一个未解决的挑战。
Result: 在ShanghaiTech A、UCF-QNRF和JHU++数据集上(使用5%、10%和40%的标注数据)的大量实验验证表明,该端到端模型在半监督分割和计数任务上达到了最先进的性能,有效地在一个统一框架内弥合了计数与实例分割之间的差距。
Insight: 创新点包括:1. 提出基于最近邻排他圆约束的EDP-SAM来生成掩码监督;2. 提出XMask方法,通过判别性掩码目标强制空间分离,并引入高斯平滑和可微分中心采样以提升性能;3. 构建了一个利用实例掩码先验作为伪标签的半监督计数框架,将实例分割与计数任务统一起来。
Abstract: Semi-supervised crowd analysis is a prominent area of research, as unlabeled data are typically abundant and inexpensive to obtain. However, traditional point-based annotations constrain performance because individual regions are inherently ambiguous, and consequently, learning fine-grained structural semantics from sparse anno tations remains an unresolved challenge. In this paper, we first propose an Exclusion-Constrained Dual-Prompt SAM (EDP-SAM), based on our Nearest Neighbor Exclusion Circle (NNEC) constraint, to generate mask supervision for current datasets. With the aim of segmenting individuals in dense scenes, we then propose Exclusivity-Guided Mask Learning (XMask), which enforces spatial separation through a discriminative mask objective. Gaussian smoothing and a differentiable center sampling strategy are utilized to improve feature continuity and training stability. Building on XMask, we present a semi-supervised crowd counting framework that uses instance mask priors as pseudo-labels, which contain richer shape information than traditional point cues. Extensive experiments on the ShanghaiTech A, UCF-QNRF, and JHU++ datasets (using 5%, 10%, and 40% labeled data) verify that our end-to-end model achieves state-of-the-art semi-supervised segmentation and counting performance, effectively bridging the gap between counting and instance segmentation within a unified framework.
[56] How to Utilize Complementary Vision-Text Information for 2D Structure Understanding cs.CV | cs.CLPDF
Jiancheng Dong, Pengyue Jia, Derong Xu, Jiawei Cheng, Jingyu Peng
TL;DR: 本文提出DiVA-Former,一种轻量级架构,旨在有效整合视觉和文本信息以提升二维表格结构理解。该方法利用视觉token作为动态查询来提炼长文本序列,从而利用视觉与文本的互补信息。在13个表格基准测试中,模型相比纯文本基线提升了23.9%,并在使用视觉、文本或两者结合的现有基线上取得了一致的增益。
Details
Motivation: 解决LLMs将2D表格线性化为1D序列时削弱行列邻接和布局线索,以及纯视觉编码器难以保留精确单元格文本的问题,旨在有效融合互补的视觉-文本信息。
Result: 在13个表格基准测试上,DiVA-Former相比纯文本基线提升23.9%,并在使用视觉、文本或两者结合的现有基线上均取得一致增益。
Insight: 创新点在于提出利用视觉token作为动态查询来提炼长文本序列的融合方法,有效缓解跨模态干扰并利用互补信息;客观分析认为其轻量级架构设计在模态融合策略上具有借鉴意义。
Abstract: LLMs typically linearize 2D tables into 1D sequences to fit their autoregressive architecture, which weakens row-column adjacency and other layout cues. In contrast, purely visual encoders can capture spatial cues, yet often struggle to preserve exact cell text. Our analysis reveals that these two modalities provide highly distinct information to LLMs and exhibit strong complementarity. However, direct concatenation and other fusion methods yield limited gains and frequently introduce cross-modal interference. To address this issue, we propose DiVA-Former, a lightweight architecture designed to effectively integrate vision and text information. DiVA-Former leverages visual tokens as dynamic queries to distill long textual sequences into digest vectors, thereby effectively exploiting complementary vision–text information. Evaluated across 13 table benchmarks, DiVA-Former improves upon the pure-text baseline by 23.9% and achieves consistent gains over existing baselines using visual inputs, textual inputs, or a combination of both.
[57] Visual Prompt Discovery via Semantic Exploration cs.CV | cs.AIPDF
Jaechang Kim, Yotaro Shimose, Zhao Wang, Kuang-Da Wang, Jungseul Ok
TL;DR: 本文提出了一种名为SEVEX的自动化语义探索框架,用于为大语言视觉模型(LVLM)发现任务特定的视觉提示,以解决其图像理解和视觉推理中的感知失败问题。该方法通过代理驱动的实验,在抽象概念空间中进行高效探索,避免了冗长低级代码的干扰和巨大搜索空间的挑战。
Details
Motivation: LVLM在图像理解和视觉推理中存在显著的感知失败问题,而现有视觉提示生成方法侧重于工具选择,未能诊断和缓解其根本原因,且依赖低效的人工试错来发现最优提示。
Result: 在评估LVLM感知能力的BlindTest和BLINK基准测试上,SEVEX在任务准确性、推理效率、探索效率和探索稳定性方面显著优于基线方法。
Insight: 创新点在于提出了一个自动化、任务导向的视觉提示发现框架,其核心是引入抽象概念空间作为搜索空间,结合新颖性引导的选择算法和语义反馈驱动的构思过程,能够发现超越传统工具使用的、复杂且反直觉的视觉策略。
Abstract: LVLMs encounter significant challenges in image understanding and visual reasoning, leading to critical perception failures. Visual prompts, which incorporate image manipulation code, have shown promising potential in mitigating these issues. While emerged as a promising direction, previous methods for visual prompt generation have focused on tool selection rather than diagnosing and mitigating the root causes of LVLM perception failures. Because of the opacity and unpredictability of LVLMs, optimal visual prompts must be discovered through empirical experiments, which have relied on manual human trial-and-error. We propose an automated semantic exploration framework for discovering task-wise visual prompts. Our approach enables diverse yet efficient exploration through agent-driven experiments, minimizing human intervention and avoiding the inefficiency of per-sample generation. We introduce a semantic exploration algorithm named SEVEX, which addresses two major challenges of visual prompt exploration: (1) the distraction caused by lengthy, low-level code and (2) the vast, unstructured search space of visual prompts. Specifically, our method leverages an abstract idea space as a search space, a novelty-guided selection algorithm, and a semantic feedback-driven ideation process to efficiently explore diverse visual prompts based on empirical results. We evaluate SEVEX on the BlindTest and BLINK benchmarks, which are designed to assess LVLM perception. Experimental results demonstrate that SEVEX significantly outperforms baseline methods in task accuracy, inference efficiency, exploration efficiency, and exploration stability. Notably, our framework discovers sophisticated and counter-intuitive visual strategies that go beyond conventional tool usage, offering a new paradigm for enhancing LVLM perception through automated, task-wise visual prompts.
[58] Grounding the Score: Explicit Visual Premise Verification for Reliable Vision-Language Process Reward Models cs.CV | cs.AIPDF
Junxin Wang, Dai Guan, Weijie Qiu, Zhihang Li, Yongbo Gai
TL;DR: 本文提出了一种名为显式视觉前提验证(EVPV)的轻量级验证接口,用于提升视觉语言过程奖励模型(VL-PRMs)的可靠性。该方法通过让策略生成步骤级的视觉检查清单来显式化所需的视觉事实,同时从输入图像中独立提取结构化视觉约束,并将清单声明与约束进行匹配以计算视觉可靠性信号,进而通过可靠性门控来校准PRM的步骤奖励,从而在无需每步调用工具的情况下,解耦感知不确定性与逻辑评估。
Details
Motivation: 当前视觉语言过程奖励模型(VL-PRMs)在评分中间推理步骤时,常作为黑盒评判器,其低分可能源于真实的推理错误或仅仅是验证器对图像的误判。这种感知与推理的纠缠导致了系统性误报(奖励幻觉的视觉前提)和漏报(惩罚正确的接地陈述),削弱了重排序和错误定位的可靠性。
Result: 在VisualProcessBench和六个多模态推理基准测试上的实验表明,EVPV改善了步骤级验证,并持续提升了Best-of-N重排序的准确率,优于强基线模型。此外,通过向提取的约束中注入受控的损坏,观察到了单调的性能下降,这为性能提升源于约束保真度和显式前提验证而非偶然的提示效应提供了因果证据。
Insight: 核心创新在于通过显式化视觉前提并独立验证其可靠性,将感知不确定性从逻辑评估中解耦出来。具体机制是引入一个轻量级的视觉检查清单和约束匹配流程,并通过可靠性门控来校准奖励,这为构建更可靠、可解释的多模态推理评估系统提供了新思路。
Abstract: Vision-language process reward models (VL-PRMs) are increasingly used to score intermediate reasoning steps and rerank candidates under test-time scaling. However, they often function as black-box judges: a low step score may reflect a genuine reasoning mistake or simply the verifier’s misperception of the image. This entanglement between perception and reasoning leads to systematic false positives (rewarding hallucinated visual premises) and false negatives (penalizing correct grounded statements), undermining both reranking and error localization. We introduce Explicit Visual Premise Verification (EVPV), a lightweight verification interface that conditions step scoring on the reliability of the visual premises a step depends on. The policy is prompted to produce a step-wise visual checklist that makes required visual facts explicit, while a constraint extractor independently derives structured visual constraints from the input image. EVPV matches checklist claims against these constraints to compute a scalar visual reliability signal, and calibrates PRM step rewards via reliability gating: rewards for visually dependent steps are attenuated when reliability is low and preserved when reliability is high. This decouples perceptual uncertainty from logical evaluation without per-step tool calls. Experiments on VisualProcessBench and six multimodal reasoning benchmarks show that EVPV improves step-level verification and consistently boosts Best-of-N reranking accuracy over strong baselines. Furthermore, injecting controlled corruption into the extracted constraints produces monotonic performance degradation, providing causal evidence that the gains arise from constraint fidelity and explicit premise verification rather than incidental prompt effects. Code is available at: https://github.com/Qwen-Applications/EVPV-PRM
[59] When Thinking Hurts: Mitigating Visual Forgetting in Video Reasoning via Frame Repetition cs.CVPDF
Xiaokun Sun, Yubo Wang, Haoyu Cao, Linli Xu
TL;DR: 本文提出FrameRepeat框架,通过轻量级的重复评分模块和Add-One-In训练策略,使视频多模态大语言模型能够自动识别并强化关键视频帧,以缓解视频问答任务中因视觉锚点漂移导致的性能下降问题。
Details
Motivation: 解决视频问答任务中多模态大语言模型因长链思维推理过程导致的视觉锚点漂移问题,即模型过度依赖自生成文本而忽视视觉输入,从而引发幻觉和性能退化。
Result: 在多个模型和数据集上的实验表明,FrameRepeat能有效且泛化地增强推理过程中的重要视觉线索,提升视频问答性能。
Insight: 创新点在于提出无需昂贵训练成本的自动化框架,通过可学习的帧评分网络指导帧重复,利用模型输出概率生成监督信号,实现轻量且可迁移的视觉遗忘缓解方案。
Abstract: Recently, Multimodal Large Language Models (MLLMs) have demonstrated significant potential in complex visual tasks through the integration of Chain-of-Thought (CoT) reasoning. However, in Video Question Answering, extended thinking processes do not consistently yield performance gains and may even lead to degradation due to ``visual anchor drifting’’, where models increasingly rely on self-generated text, sidelining visual inputs and causing hallucinations. While existing mitigations typically introduce specific mechanisms for the model to re-attend to visual inputs during inference, these approaches often incur prohibitive training costs and suffer from poor generalizability across different architectures. To address this, we propose FrameRepeat, an automated enhancement framework which features a lightweight repeat scoring module that enables Video-LLMs to autonomously identify which frames should be reinforced. We introduce a novel training strategy, Add-One-In (AOI), that uses MLLM output probabilities to generate supervision signals representing repeat gain. This can be used to train a frame scoring network, which guides the frame repetition behavior. Experimental results across multiple models and datasets demonstrate that FrameRepeat is both effective and generalizable in strengthening important visual cues during the reasoning process.
[60] Point-to-Mask: From Arbitrary Point Annotations to Mask-Level Infrared Small Target Detection cs.CVPDF
Weihua Gao, Wenlong Niu, Jie Tang, Man Yang, Jiafeng Zhang
TL;DR: 本文提出Point-to-Mask框架,用于红外小目标检测(IRSTD),旨在解决现有像素级分割方法标注成本高且不适用于弱纹理、边界模糊的小目标的问题。该框架通过物理驱动的自适应掩码生成模块将低成本点标注转换为紧凑目标掩码和几何线索,并结合轻量级半径感知点回归网络,将检测任务重新定义为目标中心定位和有效半径回归。
Details
Motivation: 现有IRSTD方法通常依赖像素级分割,需要密集标注且对纹理弱、边界模糊的小目标效果不佳,因此需要一种能利用低成本点标注实现掩码级检测的方法。
Result: 实验表明,该框架在点监督设置下实现了高质量的伪标签、高检测精度和高效推理,性能接近全监督水平,并在新构建的SIRSTD-Pixel序列数据集上进行了系统评估。
Insight: 创新点在于将点标注与掩码级检测桥接,通过物理驱动模块生成伪掩码和几何监督,以及利用时空运动线索进行中心定位和半径回归的轻量网络设计,形成训练与推理的闭环,显著降低了标注成本。
Abstract: Infrared small target detection (IRSTD) methods predominantly formulate the task as pixel-level segmentation, which requires costly dense annotations and is not well suited to tiny targets with weak texture and ambiguous boundaries. To address this issue, we propose Point-to-Mask, a framework that bridges low-cost point supervision and mask-level detection through two components: a Physics-driven Adaptive Mask Generation (PAMG) module that converts point annotations into compact target masks and geometric cues, and a lightweight Radius-aware Point Regression Network (RPR-Net) that reformulates IRSTD as target center localization and effective radius regression using spatiotemporal motion cues. The two modules form a closed loop: PAMG generates pseudo masks and geometric supervision during training, while the geometric predictions of RPR-Net are fed back to PAMG for pixel-level mask recovery during inference. To facilitate systematic evaluation, we further construct SIRSTD-Pixel, a sequential dataset with refined pixel-level annotations. Experiments show that the proposed framework achieves strong pseudo-label quality, high detection accuracy, and efficient inference, approaching full-supervision performance under point-supervised settings with substantially lower annotation cost. Code and datasets will be available at: https://github.com/GaoScience/point-to-mask.
[61] AW-MoE: All-Weather Mixture of Experts for Robust Multi-Modal 3D Object Detection cs.CV | cs.AIPDF
Hongwei Lin, Xun Huang, Chenglu Wen, Cheng Wang
TL;DR: AW-MoE是一个用于鲁棒多模态3D目标检测的框架,旨在解决自动驾驶在恶劣天气条件下的性能下降问题。它通过集成混合专家(MoE)模型,利用图像引导的天气感知路由(IWR)来精确分类天气并选择最相关的天气特定专家(WSE),同时提出统一双模态增强(UDMA)来同步增强LiDAR和4D雷达数据。
Details
Motivation: 现有方法在训练时通常简单合并所有天气样本,忽略了不同天气场景下的数据分布差异,导致性能冲突。AW-MoE旨在通过专门处理天气差异的专家模型来提升检测的鲁棒性。
Result: 在真实世界数据集上的实验表明,AW-MoE在恶劣天气条件下的性能比现有最优方法提升了约15%,且推理开销可忽略。将其集成到现有基线检测器中也能超越当前SOTA方法。
Insight: 创新点包括:1) 将MoE引入天气鲁棒的多模态3D检测,利用图像特征的判别性和不变性进行天气感知路由;2) 提出统一双模态增强方法,同步增强LiDAR和雷达数据并保持场景真实性。这提供了处理多天气数据分布差异的有效架构和增强策略。
Abstract: Robust 3D object detection under adverse weather conditions is crucial for autonomous driving. However, most existing methods simply combine all weather samples for training while overlooking data distribution discrepancies across different weather scenarios, leading to performance conflicts. To address this issue, we introduce AW-MoE, the framework that innovatively integrates Mixture of Experts (MoE) into weather-robust multi-modal 3D object detection approaches. AW-MoE incorporates Image-guided Weather-aware Routing (IWR), which leverages the superior discriminability of image features across weather conditions and their invariance to scene variations for precise weather classification. Based on this accurate classification, IWR selects the top-K most relevant Weather-Specific Experts (WSE) that handle data discrepancies, ensuring optimal detection under all weather conditions. Additionally, we propose a Unified Dual-Modal Augmentation (UDMA) for synchronous LiDAR and 4D Radar dual-modal data augmentation while preserving the realism of scenes. Extensive experiments on the real-world dataset demonstrate that AW-MoE achieves ~ 15% improvement in adverse-weather performance over state-of-the-art methods, while incurring negligible inference overhead. Moreover, integrating AW-MoE into established baseline detectors yields performance improvements surpassing current state-of-the-art methods. These results show the effectiveness and strong scalability of our AW-MoE. We will release the code publicly at https://github.com/windlinsherlock/AW-MoE.
[62] FG-SGL: Fine-Grained Semantic Guidance Learning via Motion Process Decomposition for Micro-Gesture Recognition cs.CVPDF
Jinsheng Wei, Zhaodi Xu, Guanming Lu, Haoyu Chen, Jingjie Yan
TL;DR: 本文提出了一种名为FG-SGL的细粒度语义引导学习框架,通过运动过程分解来解决微手势识别中因类间差异细微而导致的挑战。该框架联合利用细粒度和类别级语义来指导视觉-语言模型感知局部微手势运动,并设计了多级对比优化策略进行联合训练。
Details
Motivation: 现有微手势识别方法依赖类别级监督,难以捕捉细微且局部的运动差异,因此需要引入更精细的语义指导来提升模型对局部运动特征的感知能力。
Result: 实验表明FG-SGL取得了有竞争力的性能,验证了细粒度语义引导对于微手势识别的有效性。
Insight: 创新点在于构建了包含四个细化语义维度的人工标注细粒度文本数据集来描述微手势的动态过程,并提出了结合细粒度局部运动引导(FG-SA)和类别级特征可分离性增强(CP-A)的双模块框架,通过从粗到细的多级对比优化实现联合学习。
Abstract: Micro-gesture recognition (MGR) is challenging due to subtle inter-class variations. Existing methods rely on category-level supervision, which is insufficient for capturing subtle and localized motion differences. Thus, this paper proposes a Fine-Grained Semantic Guidance Learning (FG-SGL) framework that jointly integrates fine-grained and category-level semantics to guide vision–language models in perceiving local MG motions. FG-SA adopts fine-grained semantic cues to guide the learning of local motion features, while CP-A enhances the separability of MG features through category-level semantic guidance. To support fine-grained semantic guidance, this work constructs a fine-grained textual dataset with human annotations that describes the dynamic process of MGs in four refined semantic dimensions. Furthermore, a Multi-Level Contrastive Optimization strategy is designed to jointly optimize both modules in a coarse-to-fine pattern. Experiments show that FG-SGL achieves competitive performance, validating the effectiveness of fine-grained semantic guidance for MGR.
[63] VIGOR: VIdeo Geometry-Oriented Reward for Temporal Generative Alignment cs.CVPDF
Tengjiao Yin, Jinglei Shi, Heng Guo, Xi Wang
TL;DR: 该论文提出了一种名为VIGOR的几何导向奖励模型,用于解决视频扩散模型在生成视频时出现的几何不一致问题,如物体变形、空间漂移和深度违规。该方法利用预训练的几何基础模型,通过跨帧重投影误差评估多视图一致性,并引入几何感知采样策略以提升鲁棒性。通过两种互补的路径(微调或强化学习后训练,以及推理时优化)对齐视频扩散模型,实验验证了其有效性。
Details
Motivation: 视频扩散模型在训练过程中缺乏显式的几何监督,导致生成视频中出现几何不一致的伪影,如物体变形、空间漂移和深度违规,需要一种基于几何的评估和优化方法来缓解这些问题。
Result: 实验结果表明,所提出的几何奖励模型相比其他变体具有更优的鲁棒性,并通过高效的推理时缩放,为增强开源视频模型提供了实用解决方案,无需大量计算资源进行重新训练。
Insight: 创新点在于:1) 在点级别而非像素空间计算跨帧重投影误差,减少了像素强度引入的噪声,得到更物理基础和鲁棒的误差度量;2) 引入几何感知采样策略,过滤低纹理和非语义区域,专注于几何意义明确且对应关系可靠的区域进行评估;3) 通过后训练和推理时优化两种互补路径对齐视频扩散模型,实现了资源高效的模型增强。
Abstract: Video diffusion models lack explicit geometric supervision during training, leading to inconsistency artifacts such as object deformation, spatial drift, and depth violations in generated videos. To address this limitation, we propose a geometry-based reward model that leverages pretrained geometric foundation models to evaluate multi-view consistency through cross-frame reprojection error. Unlike previous geometric metrics that measure inconsistency in pixel space, where pixel intensity may introduce additional noise, our approach conducts error computation in a pointwise fashion, yielding a more physically grounded and robust error metric. Furthermore, we introduce a geometry-aware sampling strategy that filters out low-texture and non-semantic regions, focusing evaluation on geometrically meaningful areas with reliable correspondences to improve robustness. We apply this reward model to align video diffusion models through two complementary pathways: post-training of a bidirectional model via SFT or Reinforcement Learning and inference-time optimization of a Causal Video Model (e.g., Streaming video generator) via test-time scaling with our reward as a path verifier. Experimental results validate the effectiveness of our design, demonstrating that our geometry-based reward provides superior robustness compared to other variants. By enabling efficient inference-time scaling, our method offers a practical solution for enhancing open-source video models without requiring extensive computational resources for retraining.
[64] Locate-then-Sparsify: Attribution Guided Sparse Strategy for Visual Hallucination Mitigation cs.CV | cs.LGPDF
TianTian Dang, Chao Bi, Shufan Shen, Jinzhe Liu, Qingming Huang
TL;DR: 本文提出了一种名为Locate-Then-Sparsify for Feature Steering (LTS-FS)的即插即用框架,用于缓解大型视觉语言模型(LVLMs)中的幻觉问题。该方法通过因果干预归因技术量化各层与幻觉的相关性,并据此对相关层进行差异化的特征引导,从而在降低幻觉的同时保持模型在通用任务上的性能。
Details
Motivation: 现有缓解LVLM幻觉的特征引导方法对所有层采用统一的调整策略,忽略了层间差异,可能干扰与幻觉无关的层,导致通用任务性能下降。本文旨在开发一种更精细的、基于层间相关性评估的引导策略。
Result: 在多个LVLM和基准测试上的广泛实验表明,LTS-FS框架能有效缓解幻觉,同时保持强大的性能。
Insight: 主要创新点在于提出了一个两阶段框架:首先通过构建合成数据集和基于因果干预的归因方法,定位与幻觉高度相关的网络层;然后将归因分数转化为各层的差异化引导强度,实现更精准、稀疏的干预。这为模型编辑和幻觉缓解提供了一种可解释的、层感知的优化思路。
Abstract: Despite the significant advancements in Large Vision-Language Models (LVLMs), their tendency to generate hallucinations undermines reliability and restricts broader practical deployment. Among the hallucination mitigation methods, feature steering emerges as a promising approach that reduces erroneous outputs in LVLMs without increasing inference costs. However, current methods apply uniform feature steering across all layers. This heuristic strategy ignores inter-layer differences, potentially disrupting layers unrelated to hallucinations and ultimately leading to performance degradation on general tasks. In this paper, we propose a plug-and-play framework called Locate-Then-Sparsify for Feature Steering (LTS-FS), which controls the steering intensity according to the hallucination relevance of each layer. We first construct a synthetic dataset comprising token-level and sentence-level hallucination cases. Based on this dataset, we introduce an attribution method based on causal interventions to quantify the hallucination relevance of each layer. With the attribution scores across layers, we propose a layerwise strategy that converts these scores into feature steering intensities for individual layers, enabling more precise adjustments specifically on hallucination-relevant layers. Extensive experiments across multiple LVLMs and benchmarks demonstrate that our LTS-FS framework effectively mitigates hallucination while preserving strong performance.
[65] Persistent Story World Simulation with Continuous Character Customization cs.CVPDF
Jinlu Zhang, Qiyun Wang, Baoxiang Du, Jiayi Ji, Jing He
TL;DR: 本文提出EverTale,一个用于连续故事角色定制的故事世界模拟器。它通过统一的LoRA模块实现持续的角色适应,利用MLLM作为评判者确保角色保真度,并采用角色感知区域聚焦采样策略解决多角色生成中的身份退化和布局冲突问题。
Details
Motivation: 当前故事可视化方法难以在精确角色定制、语义对齐和新身份连续集成之间实现协同。本文旨在解决这一挑战,实现持续的角色定制和高质量的多角色故事生成。
Result: 实验结果表明,EverTale在单角色和多角色故事可视化任务上,相较于更广泛的对比方法,均取得了优越的性能。
Insight: 创新点包括:1)通过统一的All-in-One-World Character Integrator模块实现持续角色适应,避免了先前方法中每个角色都需要优化模块的问题;2)引入基于MLLM的Character Quality Gate进行链式思维推理,确保角色适应过程的保真度;3)提出Character-Aware Region-Focus Sampling策略,通过协调局部角色细节与全局场景上下文,高效解决身份退化和布局冲突。
Abstract: Story visualization has gained increasing attention in computer vision. However, current methods often fail to achieve a synergy between accurate character customization, semantic alignment, and continuous integration of new identities. To tackle this challenge, in this paper we present EverTale, a story world simulator for continuous story character customization. We first propose an All-in-One-World Character Integrator to achieve continuous character adaptation within unified LoRA module, eliminating the need for per-character optimization modules of previous methods. Then, we incorporate a Character Quality Gate via MLLM-as-Judge to ensure the fidelity of each character adaptation process through chain-of-thought reasoning, determining whether the model can proceed to the next character or require additional training on the current one. We also introduce a Character-Aware Region-Focus Sampling strategy to address the identity degradation and layout conflicts in existing multi-character visual storytelling, ensuring natural multi-character generation by harmonizing local character-specific details with global scene context with higher efficiency. Experimental results show that our EverTale achieves superior performance against a wider range of compared methods on both single- and multi-character story visualization. Codes will be available.
[66] VisBrowse-Bench: Benchmarking Visual-Native Search for Multimodal Browsing Agents cs.CV | cs.AIPDF
Zhengbo Zhang, Jinbo Su, Zhaowen Zhou, Changtao Miao, Yuhan Hong
TL;DR: 该论文提出了一个名为VisBrowse-Bench的新基准,用于评估多模态浏览代理的视觉原生搜索能力。该基准包含169个跨多个领域的视觉问答实例,通过文本-图像检索和多模态证据联合推理来评估模型在搜索过程中的视觉推理能力。论文还提出了一种代理工作流程,并评估了开源和闭源模型,结果显示现有模型性能有限。
Details
Motivation: 现有基准存在两个局限性:对视觉推理能力的评估不足,以及在推理链中忽视了网页的原生视觉信息。为了解决这些问题,作者旨在创建一个更全面的基准来评估多模态浏览代理的视觉原生搜索能力。
Result: 在提出的工作流程中,性能最好的模型Claude-4.6-Opus的准确率仅为47.6%,而专有的Deep Research模型o3-deep-research的准确率为41.1%,表明现有模型在该任务上仍有很大提升空间。
Insight: 论文的创新点在于构建了一个专注于评估视觉原生搜索和推理能力的基准,并通过多模态证据交叉验证(文本-图像检索和联合推理)来确保评估的严谨性。其提出的代理工作流程可以驱动代理主动收集和推理视觉信息,为未来多模态代理的研究提供了新的评估框架和方向。
Abstract: The rapid advancement of Multimodal Large Language Models (MLLMs) has enabled browsing agents to acquire and reason over multimodal information in the real world. But existing benchmarks suffer from two limitations: insufficient evaluation of visual reasoning ability and the neglect of native visual information of web pages in the reasoning chains. To address these challenges, we introduce a new benchmark for visual-native search, VisBrowse-Bench. It contains 169 VQA instances covering multiple domains and evaluates the models’ visual reasoning capabilities during the search process through multimodal evidence cross-validation via text-image retrieval and joint reasoning. These data were constructed by human experts using a multi-stage pipeline and underwent rigorous manual verification. We additionally propose an agent workflow that can effectively drive the browsing agent to actively collect and reason over visual information during the search process. We comprehensively evaluated both open-source and closed-source models in this workflow. Experimental results show that even the best-performing model, Claude-4.6-Opus only achieves an accuracy of 47.6%, while the proprietary Deep Research model, o3-deep-research only achieves an accuracy of 41.1%. The code and data can be accessed at: https://github.com/ZhengboZhang/VisBrowse-Bench
[67] Micro-AU CLIP: Fine-Grained Contrastive Learning from Local Independence to Global Dependency for Micro-Expression Action Unit Detection cs.CVPDF
Jinsheng Wei, Fengzhou Guo, Yante Li, Haoyu Chen, Guanming Lu
TL;DR: 本文提出了一种名为Micro-AU CLIP的新框架,用于微表情动作单元检测。该框架将检测过程分解为局部语义独立性建模和全局语义依赖性建模,通过设计的注意力机制和对比损失,实现了对微表情动作单元的细粒度特征学习,并可直接应用于无情感标签的微表情识别。
Details
Motivation: 现有微表情动作单元检测方法通常从整个面部图像/视频中学习特征,这与动作单元固有的局部性相冲突,导致对动作单元区域的感知不足。本文旨在探索从局部独立性到全局依赖性的模式,以更有效地建模微表情动作单元。
Result: 实验结果表明,Micro-AU CLIP能够充分学习细粒度的微表情动作单元特征,在相关基准测试上取得了最先进的性能。
Insight: 创新点在于将动作单元检测过程明确分解为局部独立性建模和全局依赖性建模两个阶段,并设计了相应的注意力机制(PTA, GDA)、损失函数(GDLoss, MiAUCL)来分别实现这两个目标。这为细粒度视觉-文本对齐和依赖关系建模提供了新思路,并可推广到其他需要局部细粒度特征与全局上下文结合的任务中。
Abstract: Micro-expression (ME) action units (Micro-AUs) provide objective clues for fine-grained genuine emotion analysis. Most existing Micro-AU detection methods learn AU features from the whole facial image/video, which conflicts with the inherent locality of AU, resulting in insufficient perception of AU regions. In fact, each AU independently corresponds to specific localized facial muscle movements (local independence), while there is an inherent dependency between some AUs under specific emotional states (global dependency). Thus, this paper explores the effectiveness of the independence-to-dependency pattern and proposes a novel micro-AU detection framework, micro-AU CLIP, that uniquely decomposes the AU detection process into local semantic independence modeling (LSI) and global semantic dependency (GSD) modeling. In LSI, Patch Token Attention (PTA) is designed, mapping several local features within the AU region to the same feature space; In GSD, Global Dependency Attention (GDA) and Global Dependency Loss (GDLoss) are presented to model the global dependency relationships between different AUs, thereby enhancing each AU feature. Furthermore, considering CLIP’s native limitations in micro-semantic alignment, a microAU contrastive loss (MiAUCL) is designed to learn AU features by a fine-grained alignment of visual and text features. Also, Micro-AU CLIP is effectively applied to ME recognition in an emotion-label-free way. The experimental results demonstrate that Micro-AU CLIP can fully learn fine-grained micro-AU features, achieving state-of-the-art performance.
[68] SpikeCLR: Contrastive Self-Supervised Learning for Few-Shot Event-Based Vision using Spiking Neural Networks cs.CVPDF
Maxime Vaillant, Axel Carlier, Lai Xing Ng, Christophe Hurter, Benoit R. Cottereau
TL;DR: 本文提出了SpikeCLR,一种用于脉冲神经网络(SNNs)的对比自监督学习框架,旨在从无标签的事件数据中学习鲁棒的视觉表示。该框架通过代理梯度训练将基于帧的方法适配到脉冲域,并引入了一套利用空间、时间和极性变换的事件特定数据增强方法。
Details
Motivation: 事件视觉传感器具有高速感知优势,但与SNNs结合时,其潜力受限于大规模标注数据集的稀缺,难以有效训练模型。
Result: 在CIFAR10-DVS、N-Caltech101、N-MNIST和DVS-Gesture基准测试中,自监督预训练结合后续微调在低数据量情况下优于监督学习,在少样本和半监督设置中取得了一致的性能提升。
Insight: 创新点包括将对比自监督学习扩展到脉冲神经网络领域,并设计事件特定的时空增强策略;客观分析表明,结合空间和时间增强对于学习事件数据中的时空不变性至关重要,且学到的表示具有跨数据集的可迁移性,有助于在标签稀缺环境下构建强大的事件驱动模型。
Abstract: Event-based vision sensors provide significant advantages for high-speed perception, including microsecond temporal resolution, high dynamic range, and low power consumption. When combined with Spiking Neural Networks (SNNs), they can be deployed on neuromorphic hardware, enabling energy-efficient applications on embedded systems. However, this potential is severely limited by the scarcity of large-scale labeled datasets required to effectively train such models. In this work, we introduce SpikeCLR, a contrastive self-supervised learning framework that enables SNNs to learn robust visual representations from unlabeled event data. We adapt prior frame-based methods to the spiking domain using surrogate gradient training and introduce a suite of event-specific augmentations that leverage spatial, temporal, and polarity transformations. Through extensive experiments on CIFAR10-DVS, N-Caltech101, N-MNIST, and DVS-Gesture benchmarks, we demonstrate that self-supervised pretraining with subsequent fine-tuning outperforms supervised learning in low-data regimes, achieving consistent gains in few-shot and semi-supervised settings. Our ablation studies reveal that combining spatial and temporal augmentations is critical for learning effective spatio-temporal invariances in event data. We further show that learned representations transfer across datasets, contributing to efforts for powerful event-based models in label-scarce settings.
[69] $D^3$-RSMDE: 40$\times$ Faster and High-Fidelity Remote Sensing Monocular Depth Estimation cs.CV | cs.AIPDF
Ruizhi Wang, Weihan Li, Zunlei Feng, Haofei Zhang, Mingli Song
TL;DR: 本文提出了一种名为$D^3$-RSMDE的高效遥感单目深度估计框架,旨在解决现有方法在精度与效率之间的权衡问题。该框架首先利用基于Vision Transformer的模块快速生成高质量的初步深度图作为结构先验,然后通过一种名为渐进式线性混合细化的策略,在变分自编码器支持的紧凑潜在空间中,使用轻量级U-Net进行少量迭代的细节优化。
Details
Motivation: 现有遥感单目深度估计方法面临准确性与效率之间的显著权衡:基于Vision Transformer的方法速度快但感知质量差,而扩散模型质量高但计算成本巨大。本文旨在克服这一局限,实现速度与质量的最佳平衡。
Result: 大量实验表明,$D^3$-RSMDE在感知指标LPIPS上比领先模型(如Marigold)降低了11.85%,同时推理速度提升了40倍以上,并且保持了与轻量级ViT模型相当的VRAM使用量。
Insight: 主要创新点在于:1) 使用ViT快速生成高质量结构先验以替代扩散模型耗时的初始结构生成阶段;2) 提出渐进式线性混合细化策略,在紧凑潜在空间中进行高效细节优化。从客观角度看,其核心创新是将快速结构生成与高效细节扩散相结合,有效解耦了结构估计与细节增强,从而在保持高保真度的同时实现了显著的加速。
Abstract: Real-time, high-fidelity monocular depth estimation from remote sensing imagery is crucial for numerous applications, yet existing methods face a stark trade-off between accuracy and efficiency. Although using Vision Transformer (ViT) backbones for dense prediction is fast, they often exhibit poor perceptual quality. Conversely, diffusion models offer high fidelity but at a prohibitive computational cost. To overcome these limitations, we propose Depth Detail Diffusion for Remote Sensing Monocular Depth Estimation ($D^3$-RSMDE), an efficient framework designed to achieve an optimal balance between speed and quality. Our framework first leverages a ViT-based module to rapidly generate a high-quality preliminary depth map construction, which serves as a structural prior, effectively replacing the time-consuming initial structure generation stage of diffusion models. Based on this prior, we propose a Progressive Linear Blending Refinement (PLBR) strategy, which uses a lightweight U-Net to refine the details in only a few iterations. The entire refinement step operates efficiently in a compact latent space supported by a Variational Autoencoder (VAE). Extensive experiments demonstrate that $D^3$-RSMDE achieves a notable 11.85% reduction in the Learned Perceptual Image Patch Similarity (LPIPS) perceptual metric over leading models like Marigold, while also achieving over a 40x speedup in inference and maintaining VRAM usage comparable to lightweight ViT models.
[70] InViC: Intent-aware Visual Cues for Medical Visual Question Answering cs.CVPDF
Zhisong Wang, Ziyang Chen, Zanting Ye, Hongze Zhu, Yefeng Zheng
TL;DR: 本文提出了一种轻量级插件框架InViC(Intent-aware Visual Cues),旨在通过提取问题相关的视觉提示令牌,并采用两阶段微调策略,来增强多模态大语言模型在医学视觉问答任务中对视觉证据的关注,从而减少其依赖语言先验的捷径回答行为,提升临床可靠性。
Details
Motivation: 现有医学视觉问答中的多模态大语言模型常表现出捷径回答,即过度依赖语言先验或数据集偏差,而对图像中的视觉证据关注不足,这尤其影响对细微影像发现的判断,损害了临床可靠性。
Result: 在三个公开的医学视觉问答基准(VQA-RAD、SLAKE和ImageCLEF VQA-Med 2019)上,InViC在多个代表性MLLMs上均一致优于零样本推理和标准的LoRA微调,证明了其有效性。
Insight: 核心创新点在于引入了问题条件化的紧凑视觉提示令牌作为结构化视觉中介,并设计了包含提示瓶颈注意力掩码的两阶段微调策略,强制模型通过提示通路处理视觉信息,从而更有效地对齐问题意图与视觉证据,是一种提升医学视觉问答可信度的实用策略。
Abstract: Medical visual question answering (Med-VQA) aims to answer clinically relevant questions grounded in medical images. However, existing multimodal large language models (MLLMs) often exhibit shortcut answering, producing plausible responses by exploiting language priors or dataset biases while insufficiently attending to visual evidence. This behavior undermines clinical reliability, especially when subtle imaging findings are decisive. We propose a lightweight plug-in framework, termed Intent-aware Visual Cues (InViC), to explicitly enhance image-based answer generation in medical VQA. InViC introduces a Cue Tokens Extraction (CTE) module that distills dense visual tokens into a compact set of K question-conditioned cue tokens, which serve as structured visual intermediaries injected into the LLM decoder to promote intent-aligned visual evidence. To discourage bypassing of visual information, we further design a two-stage fine-tuning strategy with a cue-bottleneck attention mask. In Stage I, we employ an attention mask to block the LLM’s direct view of raw visual features, thereby funneling all visual evidence through the cue pathway. In Stage II, standard causal attention is restored to train the LLM to jointly exploit the visual and cue tokens. We evaluate InViC on three public Med-VQA benchmarks (VQA-RAD, SLAKE, and ImageCLEF VQA-Med 2019) across multiple representative MLLMs. InViC consistently improves over zero-shot inference and standard LoRA fine-tuning, demonstrating that intent-aware visual cues with bottlenecked training is a practical and effective strategy for improving trustworthy Med-VQA.
[71] Semantic One-Dimensional Tokenizer for Image Reconstruction and Generation cs.CVPDF
Yunpeng Qu, Kaidong Zhang, Yukang Ding, Ying Chen, Jian Wang
TL;DR: 本文提出了一种名为SemTok的语义一维分词器,它将二维图像压缩为一维离散标记,并强调捕获高级语义信息。该方法通过创新的2D到1D标记化方案、语义对齐约束和两阶段生成训练策略,在图像重建任务上达到了新的SOTA水平,并基于此构建了掩码自回归生成框架,显著提升了下游图像生成任务的性能。
Details
Motivation: 现有视觉分词器主要将图像映射到固定的二维空间网格,并专注于像素级恢复,这阻碍了捕获具有紧凑全局语义的表示。为了解决这些问题,本文旨在开发一种能够生成紧凑且富含高级语义的一维标记的视觉分词器。
Result: SemTok在图像重建任务上达到了新的最先进水平(SOTA),以极其紧凑的标记表示实现了卓越的保真度。实验证实了其语义一维标记化的有效性,并且基于它构建的掩码自回归生成框架在下游图像生成任务中带来了显著改进。
Insight: 主要创新点包括:1)将图像从2D空间网格压缩为1D离散标记的标记化方案;2)引入语义对齐约束以提升标记的语义层次;3)采用两阶段生成训练策略。从客观角度看,将视觉标记从2D降维到1D以追求更紧凑的全局语义表示是一个值得关注的思路,可能为后续的生成模型和多模态对齐任务提供更高效的表示基础。
Abstract: Visual generative models based on latent space have achieved great success, underscoring the significance of visual tokenization. Mapping images to latents boosts efficiency and enables multimodal alignment for scaling up in downstream tasks. Existing visual tokenizers primarily map images into fixed 2D spatial grids and focus on pixel-level restoration, which hinders the capture of representations with compact global semantics. To address these issues, we propose \textbf{SemTok}, a semantic one-dimensional tokenizer that compresses 2D images into 1D discrete tokens with high-level semantics. SemTok sets a new state-of-the-art in image reconstruction, achieving superior fidelity with a remarkably compact token representation. This is achieved via a synergistic framework with three key innovations: a 2D-to-1D tokenization scheme, a semantic alignment constraint, and a two-stage generative training strategy. Building on SemTok, we construct a masked autoregressive generation framework, which yields notable improvements in downstream image generation tasks. Experiments confirm the effectiveness of our semantic 1D tokenization. Our code will be open-sourced.
[72] HGP-Mamba: Integrating Histology and Generated Protein Features for Mamba-based Multimodal Survival Risk Prediction cs.CVPDF
Jing Dai, Chen Wu, Ming Wu, Qibin Zhang, Zexi Wu
TL;DR: 该论文提出了HGP-Mamba,一个基于Mamba架构的多模态框架,用于整合组织病理学图像和生成的蛋白质特征以预测癌症生存风险。该方法通过预训练基础模型直接从全切片图像中提取高通量蛋白质嵌入,并设计了局部交互感知Mamba和全局交互增强Mamba模块来捕获细粒度和整体层面的跨模态依赖关系。
Details
Motivation: 解决蛋白质表达谱数据成本高、获取困难的问题,以充分利用蛋白质标记物和组织病理学图像的联合预后潜力,改进癌症生存风险预测。
Result: 在四个公开癌症数据集上的实验表明,HGP-Mamba取得了最先进的性能,同时与现有方法相比保持了卓越的计算效率。
Insight: 创新点在于利用预训练基础模型从WSI中生成蛋白质特征,避免了昂贵的蛋白质谱实验,并设计了专门的Mamba模块(LiAM和GiEM)进行高效的多模态特征交互与融合,实现了数据高效且计算高效的生存预测。
Abstract: Recent advances in multimodal learning have significantly improved cancer survival risk prediction. However, the joint prognostic potential of protein markers and histopathology images remains underexplored, largely due to the high cost and limited availability of protein expression profiling. To address this challenge, we propose HGP-Mamba, a Mamba-based multimodal framework that efficiently integrates histological with generated protein features for survival risk prediction. Specifically, we introduce a protein feature extractor (PFE) that leverages pretrained foundation models to derive high-throughput protein embeddings directly from Whole Slide Images (WSIs), enabling data-efficient incorporation of molecular information. Together with histology embeddings that capture morphological patterns, we further introduce the Local Interaction-aware Mamba (LiAM) for fine-grained feature interaction and the Global Interaction-enhanced Mamba (GiEM) to promote holistic modality fusion at the slide level, thus capture complex cross-modal dependencies. Experiments on four public cancer datasets demonstrate that HGP-Mamba achieves state-of-the-art performance while maintaining superior computational efficiency compared with existing methods. Our source code is publicly available at this https URL.
[73] SF-Mamba: Rethinking State Space Model for Vision cs.CV | cs.AIPDF
Masakazu Yoshimura, Teruaki Hayashi, Yuki Hoshino, Wei-Yao Wang, Takeshi Ohashi
TL;DR: 本文提出SF-Mamba,一种新颖的视觉Mamba模型,旨在解决现有视觉Mamba模型中因单向扫描机制导致的非因果交互限制和计算效率问题。通过引入辅助补丁交换和批量折叠与周期性状态重置两种关键技术,SF-Mamba在保持计算效率的同时,有效编码了双向信息流并提升了GPU并行性。在图像分类、目标检测、实例分割和语义分割等任务上的实验表明,SF-Mamba在性能上显著超越了现有最先进的基线模型,同时提高了不同模型尺寸下的吞吐量。
Details
Motivation: 现有视觉Mamba模型虽然通过循环扫描机制提供了计算效率,但其单向扫描本质上限制了图像块之间的非因果交互。先前工作试图通过多扫描策略解决此问题,但存在扫描设计不优和频繁数据重排导致的效率低下问题。此外,Mamba在视觉任务常用的短令牌长度下计算速度相对较慢。本文旨在重新思考视觉扫描操作和Mamba的计算效率,以构建真正高效的视觉编码器。
Result: 在图像分类、目标检测、实例分割和语义分割等多个视觉任务上进行了广泛实验。结果表明,所提出的SF-Mamba模型在不同模型尺寸下均显著超越了最先进的基线模型,同时提高了吞吐量。
Insight: 主要创新点包括:1)辅助补丁交换技术,在单向扫描下编码双向信息流,解决了非因果交互限制;2)批量折叠与周期性状态重置技术,提升了GPU并行性,优化了计算效率。从客观角度看,该工作对视觉Mamba的扫描机制和并行计算模式进行了系统性反思与重新设计,为构建高效视觉状态空间模型提供了新的思路。
Abstract: The realm of Mamba for vision has been advanced in recent years to strike for the alternatives of Vision Transformers (ViTs) that suffer from the quadratic complexity. While the recurrent scanning mechanism of Mamba offers computational efficiency, it inherently limits non-causal interactions between image patches. Prior works have attempted to address this limitation through various multi-scan strategies; however, these approaches suffer from inefficiencies due to suboptimal scan designs and frequent data rearrangement. Moreover, Mamba exhibits relatively slow computational speed under short token lengths, commonly used in visual tasks. In pursuit of a truly efficient vision encoder, we rethink the scan operation for vision and the computational efficiency of Mamba. To this end, we propose SF-Mamba, a novel visual Mamba with two key proposals: auxiliary patch swapping for encoding bidirectional information flow under an unidirectional scan and batch folding with periodic state reset for advanced GPU parallelism. Extensive experiments on image classification, object detection, and instance and semantic segmentation consistently demonstrate that our proposed SF-Mamba significantly outperforms state-of-the-art baselines while improving throughput across different model sizes. We will release the source code after publication.
[74] Cross-modal learning for plankton recognition cs.CVPDF
Joona Kareinen, Veikka Immonen, Tuomas Eerola, Lumi Haraguchi, Lasse Lensu
TL;DR: 本文提出了一种基于自监督跨模态协调的浮游生物识别方法,利用图像和光学测量数据(如散射和荧光剖面)的多模态信息,通过对比学习训练编码器,结合少量标注数据和k-NN分类器实现高效识别。
Details
Motivation: 当前浮游生物识别主要依赖有监督方法,需要大量人工标注数据,而现代成像仪器采集的光学测量数据未被充分利用;本文旨在利用多模态数据和自监督学习减少对标注的依赖。
Result: 该方法在浮游生物识别任务中实现了高准确率,仅需少量标注图像,且优于仅使用图像的自监督基线方法。
Insight: 创新点包括将CLIP的对比学习思想应用于浮游生物图像与光学剖面数据的跨模态对齐,通过自监督学习利用未标注多模态数据,构建了无需大量标注的识别模型。
Abstract: This paper considers self-supervised cross-modal coordination as a strategy enabling utilization of multiple modalities and large volumes of unlabeled plankton data to build models for plankton recognition. Automated imaging instruments facilitate the continuous collection of plankton image data on a large scale. Current methods for automatic plankton image recognition rely primarily on supervised approaches, which require labeled training sets that are labor-intensive to collect. On the other hand, some modern plankton imaging instruments complement image information with optical measurement data, such as scatter and fluorescence profiles, which currently are not widely utilized in plankton recognition. In this work, we explore the possibility of using such measurement data to guide the learning process without requiring manual labeling. Inspired by the concepts behind Contrastive Language-Image Pre-training, we train encoders for both modalities using only binary supervisory information indicating whether a given image and profile originate from the same particle or from different particles. For plankton recognition, we employ a small labeled gallery of known plankton species combined with a $k$-NN classifier. This approach yields a recognition model that is inherently multimodal, i.e., capable of utilizing information extracted from both image and profile data. We demonstrate that the proposed method achieves high recognition accuracy while requiring only a minimal number of labeled images. Furthermore, we show that the approach outperforms an image-only self-supervised baseline. Code available at https://github.com/Jookare/cross-modal-plankton.
[75] IRIS: A Real-World Benchmark for Inverse Recovery and Identification of Physical Dynamic Systems from Monocular Video cs.CV | cs.LGPDF
Rasul Khanbayov, Mohamed Rayan Barhdadi, Erchin Serpedin, Hasan Kurban
TL;DR: 本文提出了IRIS基准数据集,用于评估从单目视频中无监督恢复和识别物理动态系统的性能。该数据集包含220个4K分辨率、60fps的真实世界视频,涵盖单体和多体动力学系统,并提供独立测量的真实参数、不确定性估计及控制方程。论文定义了标准化评估协议,并测试了多种基线方法,揭示了系统性失败模式以推动未来研究。
Details
Motivation: 现有从视频进行无监督物理参数估计的方法缺乏通用基准:现有方法在互不重叠的合成数据上评估,唯一的真实世界数据集仅限于单体系统,且没有既定协议处理控制方程识别问题。
Result: 在IRIS数据集上评估了多种基线方法,包括多步物理损失公式和四种互补的方程识别策略(VLM时序推理、描述-分类提示、基于CNN的分类和基于路径的标注),为所有IRIS场景建立了参考性能,并揭示了系统性失败模式。
Insight: 创新点在于引入了首个高保真、涵盖单体和多体动力学的真实世界基准数据集,并提供了标准化评估协议,促进了物理参数估计和控制方程识别任务的公平比较与深入研究。
Abstract: Unsupervised physical parameter estimation from video lacks a common benchmark: existing methods evaluate on non-overlapping synthetic data, the sole real-world dataset is restricted to single-body systems, and no established protocol addresses governing-equation identification. This work introduces IRIS, a high-fidelity benchmark comprising 220 real-world videos captured at 4K resolution and 60,fps, spanning both single- and multi-body dynamics with independently measured ground-truth parameters and uncertainty estimates. Each dynamical system is recorded under controlled laboratory conditions and paired with its governing equations, enabling principled evaluation. A standardized evaluation protocol is defined encompassing parameter accuracy, identifiability, extrapolation, robustness, and governing-equation selection. Multiple baselines are evaluated, including a multi-step physics loss formulation and four complementary equation-identification strategies (VLM temporal reasoning, describe-then-classify prompting, CNN-based classification, and path-based labelling), establishing reference performance across all IRIS scenarios and exposing systematic failure modes that motivate future research. The dataset, annotations, evaluation toolkit, and all baseline implementations are publicly released.
[76] ProgressiveAvatars: Progressive Animatable 3D Gaussian Avatars cs.CV | cs.GRPDF
Kaiwen Song, Jinkai Cui, Juyong Zhang
TL;DR: 本文提出了一种名为ProgressiveAvatars的渐进式3D高斯化身表示方法,用于实时XR和远程呈现应用。该方法基于模板网格的自适应隐式细分构建层次化的3D高斯表示,支持在表情和头部运动变化下保持可动画性,并能根据屏幕空间信号动态分配资源以增加细节。它支持增量加载和渲染,在不同网络带宽和计算资源下实现平滑的质量提升。
Details
Motivation: 解决在实时XR和远程呈现应用中,网络和计算资源频繁波动,需要一种能够适应资源变化的渐进式3D表示方法的问题。
Result: 论文提出的方法支持渐进式传输和渲染,能够在波动的网络带宽以及变化的计算和内存资源下工作,实现平滑的质量改进。
Insight: 创新点在于将3D高斯定义在面部局部坐标系中以保持跨多细节级别的可动画性,并利用基于屏幕空间信号的自适应隐式细分构建层次结构,结合重要性排序实现增量加载和渲染,从而适应动态资源环境。
Abstract: In practical real-time XR and telepresence applications, network and computing resources fluctuate frequently. Therefore, a progressive 3D representation is needed. To this end, we propose ProgressiveAvatars, a progressive avatar representation built on a hierarchy of 3D Gaussians grown by adaptive implicit subdivision on a template mesh. 3D Gaussians are defined in face-local coordinates to remain animatable under varying expressions and head motion across multiple detail levels. The hierarchy expands when screen-space signals indicate a lack of detail, allocating resources to important areas. Leveraging importance ranking, ProgressiveAvatars supports incremental loading and rendering, adding new Gaussians as they arrive while preserving previous content, thus achieving smooth quality improvements across varying bandwidths. ProgressiveAvatars enables progressive delivery and progressive rendering under fluctuating network bandwidth and varying compute and memory resources.
[77] TinyGLASS: Real-Time Self-Supervised In-Sensor Anomaly Detection cs.CVPDF
Pietro Bonazzi, Rafael Sutter, Luigi Capogrosso, Mischa Buob, Michele Magno
TL;DR: 本文提出了TinyGLASS,一个轻量化的自监督异常检测框架,专为索尼IMX500智能视觉传感器上的实时传感器内计算而设计。它通过将骨干网络替换为ResNet-18,并采用静态图追踪和INT8量化,在保持竞争力的检测性能的同时,实现了8.7倍的参数压缩,并在IMX500平台上达到20 FPS的实时处理速度。
Details
Motivation: 解决现有自监督异常检测方法(如GLASS)计算需求高,难以部署在资源受限的边缘平台(如智能视觉传感器)上的问题。
Result: 在MVTec-AD基准测试上达到94.2%的图像级AUROC,在IMX500平台内存限制(8 MB)内以20 FPS运行,功耗为4.0 mJ/次推理,能效为470 GMAC/J。在训练数据存在一定污染的情况下也保持了稳定的性能。
Insight: 通过骨干网络轻量化(ResNet-18)和部署导向的优化(静态图、INT8量化),在严格资源约束下实现了高性能的实时异常检测,为边缘AI部署提供了实用的系统级优化范例。
Abstract: Anomaly detection plays a key role in industrial quality control, where defects must be identified despite the scarcity of labeled faulty samples. Recent self-supervised approaches, such as GLASS, learn normal visual patterns using only defect-free data and have shown strong performance on industrial benchmarks. However, their computational requirements limit deployment on resource-constrained edge platforms. This work introduces TinyGLASS, a lightweight adaptation of the GLASS framework designed for real-time in-sensor anomaly detection on the Sony IMX500 intelligent vision sensor. The proposed architecture replaces the original WideResNet-50 backbone with a compact ResNet-18 and introduces deployment-oriented modifications that enable static graph tracing and INT8 quantization using Sony’s Model Compression Toolkit. In addition to evaluating performance on the MVTec-AD benchmark, we investigate robustness to contaminated training data and introduce a custom industrial dataset, named MMS Dataset, for cross-device evaluation. Experimental results show that TinyGLASS achieves 8.7x parameter compression while maintaining competitive detection performance, reaching 94.2% image-level AUROC on MVTec-AD and operating at 20 FPS within the 8 MB memory constraints of the IMX500 platform. System profiling demonstrates low power consumption (4.0 mJ per inference), real-time end-to-end latency (20 FPS), and high energy efficiency (470 GMAC/J). Furthermore, the model maintains stable performance under moderate levels of training data contamination.
[78] Evo-Retriever: LLM-Guided Curriculum Evolution with Viewpoint-Pathway Collaboration for Multimodal Document Retrieval cs.CVPDF
Weiqing Li, Jinyue Guo, Yaqi Wang, Haiyang Xiao, Yuewei Zhang
TL;DR: 本文提出了Evo-Retriever,一个用于多模态文档检索的检索框架。该框架通过LLM引导的课程进化机制,结合新颖的视点-路径协作,解决了视觉-语言模型在处理异构、非结构化文档时跨模态嵌入不一致的问题。其核心包括多视图图像对齐、双向对比学习生成“困难查询”以及利用LLM元控制器自适应调整训练课程。
Details
Motivation: 现实世界中文档的异构性和非结构化特性破坏了视觉-语言模型跨模态嵌入的一致性,而现有基于有限样本和静态策略的后期交互方法无法适应模型的动态演化,导致跨模态检索混淆。
Result: 在ViDoRe V2和MMEB (VisDoc)基准测试上,Evo-Retriever取得了最先进的性能,nDCG@5分数分别达到65.2%和77.1%。
Insight: 创新点在于提出了一个LLM引导的课程进化框架,通过视点-路径协作整合了多视图对齐、双向对比学习生成困难样本以及基于LLM的自适应课程调度,从而动态优化模型训练过程,提升跨模态检索的鲁棒性和准确性。
Abstract: Visual-language models (VLMs) excel at data mappings, but real-world document heterogeneity and unstructuredness disrupt the consistency of cross-modal embeddings. Recent late-interaction methods enhance image-text alignment through multi-vector representations, yet traditional training with limited samples and static strategies cannot adapt to the model’s dynamic evolution, causing cross-modal retrieval confusion. To overcome this, we introduce Evo-Retriever, a retrieval framework featuring an LLM-guided curriculum evolution built upon a novel Viewpoint-Pathway collaboration. First, we employ multi-view image alignment to enhance fine-grained matching via multi-scale and multi-directional perspectives. Then, a bidirectional contrastive learning strategy generates “hard queries” and establishes complementary learning paths for visual and textual disambiguation to rebalance supervision. Finally, the model-state summary from the above collaboration is fed into an LLM meta-controller, which adaptively adjusts the training curriculum using expert knowledge to promote the model’s evolution. On ViDoRe V2 and MMEB (VisDoc), Evo-Retriever achieves state-of-the-art performance, with nDCG@5 scores of 65.2% and 77.1%.
[79] GAP-MLLM: Geometry-Aligned Pre-training for Activating 3D Spatial Perception in Multimodal Large Language Models cs.CVPDF
Jiaxin Zhang, Junjun Jiang, Haijie Li, Youyu Chen, Kui Jiang
TL;DR: 本文提出了GAP-MLLM,一种几何对齐的预训练范式,旨在激活多模态大语言模型(MLLMs)的3D空间感知能力。该方法通过视觉提示的联合任务,迫使模型在预测语义标签的同时预测稀疏点云图,并设计了带门控机制的多级渐进融合模块,以自适应地整合几何先验。实验表明,该方法显著提升了在3D视觉定位、3D密集描述和3D视频目标检测等任务上的性能。
Details
Motivation: 现有基于纯RGB输入的MLLMs在3D空间感知方面存在不足,作者认为性能差距源于以文本为主导的微调范式未能激活模型内部的几何表示,而非几何先验不足。现有方法通常采用简单的特征拼接,缺乏针对几何的监督,导致结构信息利用不佳。
Result: 广泛的实验表明,GAP-MLLM显著增强了几何特征融合能力,并在3D视觉定位、3D密集描述和3D视频目标检测等多个3D感知任务上持续提升了性能。
Insight: 核心创新点在于提出了一个几何对齐的预训练阶段,通过联合预测语义和稀疏点云来显式激活几何感知,并设计了自适应融合几何与语义信息的门控融合模块,避免了语义推理被抑制,为增强纯视觉MLLMs的3D空间理解提供了新思路。
Abstract: Multimodal Large Language Models (MLLMs) demonstrate exceptional semantic reasoning but struggle with 3D spatial perception when restricted to pure RGB inputs. Despite leveraging implicit geometric priors from 3D reconstruction models, image-based methods still exhibit a notable performance gap compared to methods using explicit 3D data. We argue that this gap does not arise from insufficient geometric priors, but from a misalignment in the training paradigm: text-dominated fine-tuning fails to activate geometric representations within MLLMs. Existing approaches typically resort to naive feature concatenation and optimize directly for downstream tasks without geometry-specific supervision, leading to suboptimal structural utilization. To address this limitation, we propose GAP-MLLM, a Geometry-Aligned Pre-training paradigm that explicitly activates structural perception before downstream adaptation. Specifically, we introduce a visual-prompted joint task that compels the MLLMs to predict sparse pointmaps alongside semantic labels, thereby enforcing geometric awareness. Furthermore, we design a multi-level progressive fusion module with a token-level gating mechanism, enabling adaptive integration of geometric priors without suppressing semantic reasoning. Extensive experiments demonstrate that GAP-MLLM significantly enhances geometric feature fusion and consistently enhances performance across 3D visual grounding, 3D dense captioning, and 3D video object detection tasks.
[80] Retrieving Counterfactuals Improves Visual In-Context Learning cs.CV | cs.AI | cs.CLPDF
Guangzhi Xiong, Sanchit Sinha, Zhenghao He, Aidong Zhang
TL;DR: 本文提出了一种名为CIRCLES的新框架,通过主动检索反事实风格的示例来构建视觉上下文学习(ICL)的演示集,以提升视觉语言模型(VLMs)在因果推理和细粒度属性理解上的鲁棒性。
Details
Motivation: 现有基于相似性检索的上下文学习方法倾向于选择相关但非因果的示例,放大了虚假关联并限制了模型的鲁棒性,因此需要一种能促进因果推理的示例选择方法。
Result: 在四个不同数据集上的综合实验表明,CIRCLES在多种模型架构上持续优于现有方法,尤其是在小规模模型和信息稀缺场景下提升显著。
Insight: 核心创新在于通过属性引导的组合图像检索主动构建反事实风格的演示集,使模型能够隐式地推理属性与结果之间的因果关系,超越了表面的相关性,从而提升了推理的鲁棒性和可解释性。
Abstract: Vision-language models (VLMs) have achieved impressive performance across a wide range of multimodal reasoning tasks, but they often struggle to disentangle fine-grained visual attributes and reason about underlying causal relationships. In-context learning (ICL) offers a promising avenue for VLMs to adapt to new tasks, but its effectiveness critically depends on the selection of demonstration examples. Existing retrieval-augmented approaches typically rely on passive similarity-based retrieval, which tends to select correlated but non-causal examples, amplifying spurious associations and limiting model robustness. We introduce CIRCLES (Composed Image Retrieval for Causal Learning Example Selection), a novel framework that actively constructs demonstration sets by retrieving counterfactual-style examples through targeted, attribute-guided composed image retrieval. By incorporating counterfactual-style examples, CIRCLES enables VLMs to implicitly reason about the causal relations between attributes and outcomes, moving beyond superficial correlations and fostering more robust and grounded reasoning. Comprehensive experiments on four diverse datasets demonstrate that CIRCLES consistently outperforms existing methods across multiple architectures, especially on small-scale models, with pronounced gains under information scarcity. Furthermore, CIRCLES retrieves more diverse and causally informative examples, providing qualitative insights into how models leverage in-context demonstrations for improved reasoning. Our code is available at https://github.com/gzxiong/CIRCLES.
[81] VIEW2SPACE: Studying Multi-View Visual Reasoning from Sparse Observations cs.CVPDF
Fucai Ke, Zhixi Cai, Boying Li, Long Chen, Beibei Lin
TL;DR: 论文提出了VIEW2SPACE,一个用于稀疏多视角视觉推理的多维基准测试,通过物理模拟构建了多样化的高保真3D场景来生成可扩展的数据集,并评估了现有视觉语言和空间模型,发现多视角推理任务仍远未解决。
Details
Motivation: 解决现实世界中智能系统从稀疏离散视角理解复杂环境的多视角视觉推理问题,现有研究多集中于单图像或时间密集视频,缺乏大规模、带精确几何和语义标注的多视角数据。
Result: 在VIEW2SPACE基准上,现有最先进的视觉语言和空间模型表现仅略高于随机猜测;提出的Grounded Chain-of-Thought with Visual Evidence方法在中等难度下显著提升了性能,并在跨数据集评估中优于现有方法,泛化到了真实世界数据。
Insight: 利用物理模拟生成可扩展、可迁移到真实场景的多视角数据是解决数据稀缺的关键创新;提出的基准和基于视觉证据的思维链方法为多视角推理提供了新的评估框架和性能提升路径;分析表明几何感知在足够可见度下可通过缩放受益,但跨稀疏视图的深度组合推理仍是根本挑战。
Abstract: Multi-view visual reasoning is essential for intelligent systems that must understand complex environments from sparse and discrete viewpoints, yet existing research has largely focused on single-image or temporally dense video settings. In real-world scenarios, reasoning across views requires integrating partial observations without explicit guidance, while collecting large-scale multi-view data with accurate geometric and semantic annotations remains challenging. To address this gap, we leverage physically grounded simulation to construct diverse, high-fidelity 3D scenes with precise per-view metadata, enabling scalable data generation that remains transferable to real-world settings. Based on this engine, we introduce VIEW2SPACE, a multi-dimensional benchmark for sparse multi-view reasoning, together with a scalable, disjoint training split supporting millions of grounded question-answer pairs. Using this benchmark, a comprehensive evaluation of state-of-the-art vision-language and spatial models reveals that multi-view reasoning remains largely unsolved, with most models performing only marginally above random guessing. We further investigate whether training can bridge this gap. Our proposed Grounded Chain-of-Thought with Visual Evidence substantially improves performance under moderate difficulty, and generalizes to real-world data, outperforming existing approaches in cross-dataset evaluation. We further conduct difficulty-aware scaling analyses across model size, data scale, reasoning depth, and visibility constraints, indicating that while geometric perception can benefit from scaling under sufficient visibility, deep compositional reasoning across sparse views remains a fundamental challenge.
[82] Rethinking Pose Refinement in 3D Gaussian Splatting under Pose Prior and Geometric Uncertainty cs.CVPDF
Mangyu Kong, Jaewon Lee, Seongwon Lee, Euntai Kim
TL;DR: 本文重新审视了基于3D高斯泼溅(3DGS)的相机姿态优化问题,指出其鲁棒性严重依赖于初始姿态和重建几何的准确性。针对姿态先验不确定性和几何不确定性这两个主要误差来源,作者提出了一种结合蒙特卡洛姿态采样与基于费舍尔信息的PnP优化的重定位框架,无需重新训练或额外监督。该方法在多种室内外基准测试中,显著提升了定位精度和在姿态与深度噪声下的稳定性。
Details
Motivation: 尽管3DGS能提供高质量的可微分渲染,但基于它的姿态优化对初始相机姿态和重建的几何结构非常敏感,鲁棒性不足。本文旨在解决由姿态先验(如回归/检索模型输出单一确定性估计)和3DGS重建不完美所导致的几何不确定性,这些不确定性会扭曲重投影几何并破坏优化稳定性。
Result: 在多种室内和室外基准测试中,该方法一致地提高了定位精度,并显著增强了在姿态和深度噪声下的稳定性。
Insight: 核心创新在于明确建模并联合处理姿态先验不确定性与几何不确定性,通过蒙特卡洛采样探索姿态空间,并利用费舍尔信息指导PnP优化,从而在不增加额外训练或监督的情况下,提升了3DGS姿态优化的鲁棒性和准确性。
Abstract: 3D Gaussian Splatting (3DGS) has recently emerged as a powerful scene representation and is increasingly used for visual localization and pose refinement. However, despite its high-quality differentiable rendering, the robustness of 3DGS-based pose refinement remains highly sensitive to both the initial camera pose and the reconstructed geometry. In this work, we take a closer look at these limitations and identify two major sources of uncertainty: (i) pose prior uncertainty, which often arises from regression or retrieval models that output a single deterministic estimate, and (ii) geometric uncertainty, caused by imperfections in the 3DGS reconstruction that propagate errors into PnP solvers. Such uncertainties can distort reprojection geometry and destabilize optimization, even when the rendered appearance still looks plausible. To address these uncertainties, we introduce a relocalization framework that combines Monte Carlo pose sampling with Fisher Information-based PnP optimization. Our method explicitly accounts for both pose and geometric uncertainty and requires no retraining or additional supervision. Across diverse indoor and outdoor benchmarks, our approach consistently improves localization accuracy and significantly increases stability under pose and depth noise.
[83] Segmentation-Based Attention Entropy: Detecting and Mitigating Object Hallucinations in Large Vision-Language Models cs.CV | cs.MMPDF
Jiale Song, Jiaxin Luo, Xue-song Tang, Kuangrong Hao, Mingbo Zhao
TL;DR: 本文提出了一种基于分割的注意力熵(SAE)方法,用于检测和缓解大型视觉语言模型(LVLM)中的物体幻觉问题。该方法利用语义分割在物体级语义空间中量化视觉注意力的不确定性,并基于SAE设计了幻觉检测的可靠性评分和推理时的注意力调整机制,以在不增加训练成本的情况下减少幻觉。
Details
Motivation: 现有研究多从文本模态出发,将物体幻觉归因于过强的语言先验和视觉基础不足,而本文观察到视觉模态内部的异常注意力模式也会导致幻觉物体,因此提出从视觉注意力不确定性角度进行检测和缓解。
Result: 在公开基准测试和四足机器人的真实具身多模态场景中,实验结果表明SAE显著减少了物体幻觉,且无需额外训练成本,从而实现了更可信的LVLM驱动感知与决策。
Insight: 创新点在于首次从视觉模态内部注意力模式异常的角度分析物体幻觉,并利用语义分割构建对象级语义空间来量化注意力不确定性;可借鉴之处包括基于分割的注意力熵计算、推理时动态调整注意力的轻量化缓解策略,以及适用于具身智能场景的评估验证。
Abstract: Large Vision-Language Models (LVLMs) achieve strong performance on many multimodal tasks, but object hallucinations severely undermine their reliability. Most existing studies focus on the text modality, attributing hallucinations to overly strong language priors and insufficient visual grounding. In contrast, we observe that abnormal attention patterns within the visual modality can also give rise to hallucinated objects. Building on this observation, we propose Segmentation-based Attention Entropy (SAE), which leverages semantic segmentation to quantify visual attention uncertainty in an object-level semantic space. Based on SAE, we further design a reliability score for hallucination detection and an SAE-guided attention adjustment method that modifies visual attention at inference time to mitigate hallucinations. We evaluate our approach on public benchmarks and in real embodied multimodal scenarios with quadruped robots. Experimental results show that SAE substantially reduces object hallucinations without any additional training cost, thereby enabling more trustworthy LVLM-driven perception and decision-making.
[84] Understanding Cell Fate Decisions with Temporal Attention cs.CV | q-bio.CB | q-bio.QMPDF
Florian Bürger, Martim Dias Gomes, Adrián E. Granada, Noémie Moreau, Katarzyna Bozek
TL;DR: 本文提出了一种基于Transformer的深度学习模型,用于从化疗处理的癌细胞长期活细胞成像原始序列中直接预测细胞命运(如凋亡或存活),无需依赖预定义的形态或分子特征。模型不仅实现了高精度的分类,还引入了一个全面的可解释性框架,以揭示引导预测的时间与形态线索。
Details
Motivation: 解决在相同治疗条件下,基因相同的细胞为何表现出不同命运(如凋亡或存活)的问题,这对于理解和改进癌症疗法至关重要。传统方法依赖预定义特征,而本文旨在直接从原始图像序列中学习并解释细胞命运决策的非遗传决定因素。
Result: 模型在仅使用视频数据的情况下,达到了0.94的平衡准确率和0.93的F1分数。注意力与掩蔽实验表明,预测细胞命运的信号并非仅存在于细胞轨迹的最后几帧,可靠预测可在事件发生前长达10小时实现。
Insight: 创新点在于将基于注意力的时序模型(Transformer)应用于原始细胞成像序列的直接命运预测,并构建了可解释性框架来揭示预测所依据的时间分布(如有丝分裂和凋亡序列中预测信息的独特时间分布)和形态学线索(如细胞形态和p53信号的作用)。这为理解细胞决策的非遗传机制提供了新工具和生物学见解。
Abstract: Understanding non-genetic determinants of cell fate is critical for developing and improving cancer therapies, as genetically identical cells can exhibit divergent outcomes under the same treatment conditions. In this work, we present a deep learning approach for cell fate prediction from raw long-term live-cell recordings of cancer cell populations under chemotherapeutic treatment. Our Transformer model is trained to predict cell fate directly from raw image sequences, without relying on predefined morphological or molecular features. Beyond classification, we introduce a comprehensive explainability framework for interpreting the temporal and morphological cues guiding the model’s predictions. We demonstrate that prediction of cell outcomes is possible based on the video only, our model achieves balanced accuracy of 0.94 and an F1-score of 0.93. Attention and masking experiments further indicate that the signal predictive of the cell fate is not uniquely located in the final frames of a cell trajectory, as reliable predictions are possible up to 10 h before the event. Our analysis reveals distinct temporal distribution of predictive information in the mitotic and apoptotic sequences, as well as the role of cell morphology and p53 signaling in determining cell outcomes. Together, these findings demonstrate that attention-based temporal models enable accurate cell fate prediction while providing biologically interpretable insights into non-genetic determinants of cellular decision-making. The code is available at https://github.com/bozeklab/Cell-Fate-Prediction.
[85] VideoMatGen: PBR Materials through Joint Generative Modeling cs.CV | cs.GRPDF
Jon Hasselgren, Zheng Zeng, Milos Hasan, Jacob Munkberg
TL;DR: 本文提出VideoMatGen方法,基于视频扩散Transformer架构,通过联合建模多种材质属性(基础色、粗糙度、金属度、高度图)为3D形状生成物理准确的材质,并引入定制化变分自编码器将多模态材质编码到紧凑潜在空间,实现高效联合生成。
Details
Motivation: 解决为3D形状生成高质量、物理准确的材质的问题,现有方法在联合生成多种材质属性时面临计算复杂度和模态协调的挑战。
Result: 方法能根据文本提示为3D形状生成高质量材质,兼容常见内容创作工具,但摘要未提及具体基准测试或定量比较结果。
Insight: 创新点包括使用视频扩散Transformer进行材质生成,以及定制化变分自编码器实现多模态材质的紧凑联合表示,避免了令牌数量增加,提升了生成效率。
Abstract: We present a method for generating physically-based materials for 3D shapes based on a video diffusion transformer architecture. Our method is conditioned on input geometry and a text description, and jointly models multiple material properties (base color, roughness, metallicity, height map) to form physically plausible materials. We further introduce a custom variational auto-encoder which encodes multiple material modalities into a compact latent space, which enables joint generation of multiple modalities without increasing the number of tokens. Our pipeline generates high-quality materials for 3D shapes given a text prompt, compatible with common content creation tools.
[86] REFORGE: Multi-modal Attacks Reveal Vulnerable Concept Unlearning in Image Generation Models cs.CV | cs.AI | cs.CR | cs.LGPDF
Yong Zou, Haoran Li, Fanxiao Li, Shenyang Wei, Yunyun Dong
TL;DR: 本文提出了REFORGE,一个用于评估图像生成模型遗忘(IGMU)鲁棒性的黑盒红队框架。该框架通过生成对抗性图像提示,利用基于笔划的初始化和跨注意力引导的掩蔽策略来优化扰动,在保持视觉保真度的同时有效攻击已遗忘有害概念的模型。实验表明,REFORGE在多个代表性遗忘任务和防御方法上显著提高了攻击成功率,揭示了当前IGMU方法在面对多模态对抗攻击时存在持续脆弱性。
Details
Motivation: 图像生成模型(IGMs)在生成高保真内容的同时,也带来了复制受版权保护内容和生成冒犯性内容的风险。图像生成模型遗忘(IGMU)旨在无需完全重新训练即可移除有害概念,但其在对抗性输入(尤其是黑盒设置下的图像侧威胁)下的鲁棒性尚未得到充分探索。
Result: 在多个代表性遗忘任务和防御方法上进行的大量实验表明,REFORGE显著提高了攻击成功率,同时比涉及的基线方法实现了更强的语义对齐和更高的效率。
Insight: 论文的主要创新点在于提出了一个专门针对IGMU的黑盒红队评估框架REFORGE,其核心是跨注意力引导的掩蔽策略,该策略将噪声分配到与概念相关的区域,从而在攻击有效性和视觉保真度之间取得平衡。从客观角度看,这项工作将对抗攻击的研究视角扩展到了模型遗忘这一新兴领域,并强调了针对多模态对抗攻击开发具有鲁棒性意识的遗忘方法的必要性。
Abstract: Recent progress in image generation models (IGMs) enables high-fidelity content creation but also amplifies risks, including the reproduction of copyrighted content and the generation of offensive content. Image Generation Model Unlearning (IGMU) mitigates these risks by removing harmful concepts without full retraining. Despite growing attention, the robustness under adversarial inputs, particularly image-side threats in black-box settings, remains underexplored. To bridge this gap, we present REFORGE, a black-box red-teaming framework that evaluates IGMU robustness via adversarial image prompts. REFORGE initializes stroke-based images and optimizes perturbations with a cross-attention-guided masking strategy that allocates noise to concept-relevant regions, balancing attack efficacy and visual fidelity. Extensive experiments across representative unlearning tasks and defenses demonstrate that REFORGE significantly improves attack success rate while achieving stronger semantic alignment and higher efficiency than involved baselines. These results expose persistent vulnerabilities in current IGMU methods and highlight the need for robustness-aware unlearning against multi-modal adversarial attacks. Our code is at: https://github.com/Imfatnoily/REFORGE.
[87] On the Transfer of Collinearity to Computer Vision cs.CVPDF
Frederik Beuth, Danny Kowerko
TL;DR: 该论文旨在将人类视觉感知中的共线性原则引入计算机视觉领域,探索其在多个应用场景中的潜在用途。通过开发原型模型并在四个用例中进行系统测试和基准评估,研究发现共线性在晶圆缺陷检测、纳米材料缺陷识别和遮挡处理等工业应用中能显著提升性能,但在ImageNet等自然图像数据集上效果有限。
Details
Motivation: 人类大脑中的共线性视觉感知现象能增强沿直线排列的空间对齐边缘,但其在现实世界中的目的和在计算机视觉及工程应用中的利用尚不明确。本文的动机是将这一原则迁移到计算机视觉中,探索其新颖应用潜力。
Result: 在晶圆缺陷检测中,共线性将错误率从6.5%降至5.26%,性能提升1.24倍;在纳米材料缺陷识别中,结合深度学习将错误率从21.65%降至6.64%,性能提升3.2倍。共线性在遮挡处理中也有益,但在ImageNet上效果不佳。
Insight: 论文的创新点在于首次将共线性原则系统性地迁移到计算机视觉中,并验证了其在工业应用(尤其是涉及人造线性结构的图像)中的有效性。这为计算机视觉提供了新的工具,可能有助于模拟人类视觉处理能力,特别是在特定结构化场景中。
Abstract: Collinearity is a visual perception phenomenon in the human brain that amplifies spatially aligned edges arranged along a straight line. However, it is vague for which purpose humans might have this principle in the real-world, and its utilization in computer vision and engineering applications even is a largely unexplored field. In this work, our goal is to transfer the collinearity principle to computer vision, and we explore the potential usages of this novel principle for computer vision applications. We developed a prototype model to exemplify the principle, then tested it systematically, and benchmarked it in the context of four use cases. Our cases are selected to spawn a broad range of potential applications and scenarios: sketching the combination of collinearity with deep learning (case I and II), using collinearity with saliency models (case II), and as a feature detector (case I). In the first use case, we found that collinearity is able to improve the fault detection of wafers and obtain a performance increase by a factor 1.24 via collinearity (decrease of the error rate from 6.5% to 5.26%). In the second use case, we test the defect recognition in nanotechnology materials and achieve a performance increase by 3.2x via collinearity (deep learning, error from 21.65% to 6.64%), and also explore saliency models. As third experiment, we cover occlusions; while as fourth experiment, we test ImageNet and observe that it might not be very beneficial for ImageNet. Therefore, we can assemble a list of scenarios for which collinearity is beneficial (wafers, nanotechnology, occlusions), and for what is not beneficial (ImageNet). Hence, we infer collinearity might be suitable for industry applications as it helps if the image structures of interest are man-made because they often consist of lines. Our work provides another tool for CV, hope to capture the power of human processing.
[88] Rationale Matters: Learning Transferable Rubrics via Proxy-Guided Critique for VLMReward Models cs.CVPDF
Weijie Qiu, Dai Guan, Junxin Wang, Zhihang Li, Yongbo Gai
TL;DR: 本文提出Proxy-GRM方法,通过引入代理引导的评分标准验证机制来优化视觉语言模型生成式奖励模型中的评分标准生成阶段,从而提升评分标准的质量和可迁移性。
Details
Motivation: 现有生成式奖励模型通常采用三阶段流程(评分标准生成、基于准则打分、最终裁决),但中间评分标准很少被直接优化,且现有方法要么将其视为附带产物,要么依赖昂贵且不可微的LLM-as-judge检查,缺乏训练指导。
Result: 在约5万数据样本上,Proxy-GRM在VL-Reward Bench、Multimodal Reward Bench和MM-RLHF-Reward Bench上达到SOTA水平,优于使用四倍数据训练的方法。消融实验表明Proxy-SFT是比Proxy-RL更强的验证器,且隐式奖励聚合效果最佳。
Insight: 核心创新在于将评分标准质量作为可优化的目标,通过训练轻量级代理模型(Proxy-SFT/Proxy-RL)来预测偏好排序并以其准确率作为奖励信号,从而激励模型生成内部一致且可迁移的评分标准,这解决了中间表示缺乏直接监督的问题。
Abstract: Generative reward models (GRMs) for vision-language models (VLMs) often evaluate outputs via a three-stage pipeline: rubric generation, criterion-based scoring, and a final verdict. However, the intermediate rubric is rarely optimized directly. Prior work typically either treats rubrics as incidental or relies on expensive LLM-as-judge checks that provide no differentiable signal and limited training-time guidance. We propose Proxy-GRM, which introduces proxy-guided rubric verification into Reinforcement Learning (RL) to explicitly enhance rubric quality. Concretely, we train lightweight proxy agents (Proxy-SFT and Proxy-RL) that take a candidate rubric together with the original query and preference pair, and then predict the preference ordering using only the rubric as evidence. The proxy’s prediction accuracy serves as a rubric-quality reward, incentivizing the model to produce rubrics that are internally consistent and transferable. With ~50k data samples, Proxy-GRM reaches state-of-the-art results on the VL-Reward Bench, Multimodal Reward Bench, and MM-RLHF-Reward Bench, outperforming the methods trained on four times the data. Ablations show Proxy-SFT is a stronger verifier than Proxy-RL, and implicit reward aggregation performs best. Crucially, the learned rubrics transfer to unseen evaluators, improving reward accuracy at test time without additional training. Our code is available at https://github.com/Qwen-Applications/Proxy-GRM.
[89] MLLM-based Textual Explanations for Face Comparison cs.CV | cs.AIPDF
Redwan Sony, Anil K Jain, Ross Arun
TL;DR: 本文系统分析了多模态大语言模型(MLLMs)在无约束人脸验证任务中生成文本解释的可靠性,发现即使模型做出正确决策,其解释也常依赖不可验证或幻觉的面部属性,且结合传统人脸识别系统信息未能持续提升解释忠实度,并提出了基于似然比的评估框架。
Details
Motivation: 解决MLLMs在无约束人脸图像上生成解释的可靠性不足问题,特别是在极端姿态和监控图像场景下,现有解释常缺乏视觉证据支持。
Result: 在IJB-S数据集上的实验表明,MLLMs的解释常包含非可验证属性;结合传统系统分数虽提升分类性能,但未改善解释忠实度;新提出的似然比框架用于评估解释的证据强度。
Insight: 创新点在于系统评估MLLMs人脸解释的可靠性并揭示其局限性,提出了基于似然比的解释评估方法,强调生物识别应用中需原则性评估可信解释。
Abstract: Multimodal Large Language Models (MLLMs) have recently been proposed as a means to generate natural-language explanations for face recognition decisions. While such explanations facilitate human interpretability, their reliability on unconstrained face images remains underexplored. In this work, we systematically analyze MLLM-generated explanations for the unconstrained face verification task on the challenging IJB-S dataset, with a particular focus on extreme pose variation and surveillance imagery. Our results show that even when MLLMs produce correct verification decisions, the accompanying explanations frequently rely on non-verifiable or hallucinated facial attributes that are not supported by visual evidence. We further study the effect of incorporating information from traditional face recognition systems, viz., scores and decisions, alongside the input images. Although such information improves categorical verification performance, it does not consistently lead to faithful explanations. To evaluate the explanations beyond decision accuracy, we introduce a likelihood-ratio-based framework that measures the evidential strength of textual explanations. Our findings highlight fundamental limitations of current MLLMs for explainable face recognition and underscore the need for a principled evaluation of reliable and trustworthy explanations in biometric applications. Code is available at https://github.com/redwankarimsony/LR-MLLMFR-Explainability.
[90] FlowComposer: Composable Flows for Compositional Zero-Shot Learning cs.CVPDF
Zhenqi He, Lin Li, Long Chen
TL;DR: 本文提出FlowComposer,一个用于组合零样本学习(CZSL)的模型无关框架。它通过流匹配学习两个原始流,将视觉特征分别传输到属性和对象的文本嵌入,并引入一个可学习的组合器,在嵌入空间中显式融合它们的速度场以生成组合流。此外,设计了一种泄漏引导的数据增强方案,以利用残留的特征纠缠。该框架作为即插即用模块集成到多个基线模型中,在三个公开CZSL基准上取得了显著提升。
Details
Motivation: 解决当前基于视觉语言模型(VLM)和参数高效微调(PEFT)的CZSL方法存在的两个根本限制:1)隐式的组合构建(仅通过标记拼接或分支提示调优实现,缺乏嵌入空间中的显式操作);2)残留的特征纠缠(不完美的解纠缠导致属性、对象和组合特征相互污染),这些限制了模型的泛化能力。
Result: 在三个公开的CZSL基准(如MIT-States、UT-Zappos、C-GQA)上进行了全面评估。将FlowComposer作为即插即用组件集成到各种基线模型中,均实现了显著的性能提升,达到了当前最先进(SOTA)水平。
Insight: 主要创新点包括:1)首次系统性地将流匹配应用于CZSL,通过显式的流操作(原始流和组合流)在嵌入空间中构建组合,解决了隐式组合问题;2)设计了一个可学习的组合器,显式融合属性和对象的速度场;3)提出泄漏引导的数据增强方案,创新性地将残留的特征纠缠作为辅助信号加以利用,而非单纯视为噪声。从客观角度看,该框架的模型无关性和即插即用特性具有很好的通用性和可扩展性。
Abstract: Compositional zero-shot learning (CZSL) aims to recognize unseen attribute-object compositions by recombining primitives learned from seen pairs. Recent CZSL methods built on vision-language models (VLMs) typically adopt parameter-efficient fine-tuning (PEFT). They apply visual disentanglers for decomposition and manipulate token-level prompts or prefixes to encode compositions. However, such PEFT-based designs suffer from two fundamental limitations: (1) Implicit Composition Construction, where composition is realized only via token concatenation or branch-wise prompt tuning rather than an explicit operation in the embedding space; (2) Remained Feature Entanglement, where imperfect disentanglement leaves attribute, object, and composition features mutually contaminated. Together, these issues limit the generalization ability of current CZSL models. In this paper, we are the first to systematically study flow matching for CZSL and introduce FlowComposer, a model-agnostic framework that learns two primitive flows to transport visual features toward attribute and object text embeddings, and a learnable Composer that explicitly fuses their velocity fields into a composition flow. To exploit the inevitable residual entanglement, we further devise a leakage-guided augmentation scheme that reuses leaked features as auxiliary signals. We thoroughly evaluate FlowComposer on three public CZSL benchmarks by integrating it as a plug-and-play component into various baselines, consistently achieving significant improvements.
[91] BUSSARD: Normalizing Flows for Bijective Universal Scene-Specific Anomalous Relationship Detection cs.CVPDF
Melissa Schween, Mathis Kruse, Bodo Rosenhahn
TL;DR: 本文提出BUSSARD模型,一种基于标准化流的双向变换模型,用于检测从图像生成的场景图中的异常关系。该方法采用多模态策略,利用语言模型嵌入场景图中的物体和关系标记以利用现实世界的语义知识,并通过标准化流学习将物体-关系-物体三元组映射到简单基分布(如高斯分布),从而通过似然估计进行异常检测。
Details
Motivation: 解决场景图中异常关系检测的问题,旨在利用语义知识和概率模型来识别图像场景中不合理的物体间关系。
Result: 在包含办公室和餐厅场景的SARD数据集上评估,相比当前最先进模型,AUROC结果提升约10%,同时速度快五倍;消融实验显示模型具有优越的鲁棒性和通用性,特别是在使用同义词时基线性能偏差达17.5%,而本模型保持稳定。
Insight: 创新点在于结合语言模型的多模态嵌入与标准化流的双向变换,实现高效的场景图异常检测;客观分析认为该方法通过概率映射和语义利用,在检测精度和速度上均取得显著提升,展示了基于学习的方法在关系异常检测中的潜力。
Abstract: We propose Bijective Universal Scene-Specific Anomalous Relationship Detection (BUSSARD), a normalizing flow-based model for detecting anomalous relations in scene graphs, generated from images. Our work follows a multimodal approach, embedding object and relationship tokens from scene graphs with a language model to leverage semantic knowledge from the real world. A normalizing flow model is used to learn bijective transformations that map object-relation-object triplets from scene graphs to a simple base distribution (typically Gaussian), allowing anomaly detection through likelihood estimation. We evaluate our approach on the SARD dataset containing office and dining room scenes. Our method achieves around 10% better AUROC results compared to the current state-of-the-art model, while simultaneously being five times faster. Through ablation studies, we demonstrate superior robustness and universality, particularly regarding the use of synonyms, with our model maintaining stable performance while the baseline shows 17.5% deviation. This work demonstrates the strong potential of learning-based methods for relationship anomaly detection in scene graphs. Our code is available at https://github.com/mschween/BUSSARD .
[92] HeBA: Heterogeneous Bottleneck Adapters for Robust Vision-Language Models cs.CVPDF
Md Jahidul Islam
TL;DR: 本文提出了一种名为HeBA(异构瓶颈适配器)的统一架构框架,用于增强大规模视觉语言模型(如CLIP)在下游任务中的适应能力。该方法通过引入模态特定的结构归纳偏置,解决了传统适配器架构在处理视觉和文本模态时采用“一刀切”方式的问题。HeBA采用三种关键架构创新:异构处理、瓶颈正则化和主动梯度初始化,从而在11个少样本基准测试中实现了最先进的性能。
Details
Motivation: 当前适应大规模视觉语言模型(如CLIP)到下游任务时,通常采用“一刀切”的架构方法,即使用宽泛、通用的适配器统一处理视觉和文本标记。这种同质性忽略了模态之间的结构差异:图像的空间局部性与文本的语义密度。因此,本文旨在通过引入模态特定的结构归纳偏置来解决这一问题,以提升模型的鲁棒性和准确性。
Result: 广泛的实验表明,HeBA的架构专业化设计在11个少样本基准测试中实现了最先进的性能,表现出卓越的稳定性和准确性。
Insight: 论文的创新点包括:1)异构处理:通过2D深度可分离卷积处理视觉标记以保留空间相关性,同时通过密集线性投影处理文本标记以捕捉语义关系;2)瓶颈正则化:采用压缩瓶颈(D -> D/4)强制模型学习紧凑、鲁棒的特征,并作为结构正则化器;3)主动梯度初始化:挑战限制性的零初始化范式,使用Kaiming初始化策略确保足够的初始梯度流以加速收敛,同时不损害冻结主干网络的预训练知识。从客观角度看,这些创新有效地结合了模态特异性与结构正则化,为视觉语言模型的适配提供了新的架构思路。
Abstract: Adapting large-scale Vision-Language Models (VLMs) like CLIP to downstream tasks often suffers from a “one-size-fits-all” architectural approach, where visual and textual tokens are processed uniformly by wide, generic adapters. We argue that this homogeneity ignores the distinct structural nature of the modalities – spatial locality in images versus semantic density in text. To address this, we propose HeBA (Heterogeneous Bottleneck Adapter), a unified architectural framework that introduces modality-specific structural inductive biases. HeBA departs from conventional designs through three key architectural innovations: (1) Heterogeneity: It processes visual tokens via 2D depthwise-separable convolutions to preserve spatial correlations, while distinctively processing text tokens via dense linear projections to capture semantic relationships; (2) Bottleneck Regularization: Unlike standard expanding adapters, HeBA employs a compression bottleneck (D -> D/4) that explicitly forces the model to learn compact, robust features and acts as a structural regularizer; and (3) Active Gradient Initialization: We challenge the restrictive zero-initialization paradigm, utilizing a Kaiming initialization strategy that ensures sufficient initial gradient flow to accelerate convergence without compromising the frozen backbone’s pre-trained knowledge. Extensive experiments demonstrate that HeBA’s architecturally specialized design achieves superior stability and accuracy, establishing a new state-of-the-art on 11 few-shot benchmarks. Code is available at https://github.com/Jahid12012021/VLM-HeBA.
[93] Kestrel: Grounding Self-Refinement for LVLM Hallucination Mitigation cs.CV | cs.AIPDF
Jiawei Mao, Hardy Chen, Haoqin Tu, Yuhan Wang, Letian Zhang
TL;DR: Kestrel是一个无需训练的大视觉语言模型幻觉缓解框架,通过结合显式视觉基础代理与证据验证的自优化机制来减少多模态任务中的幻觉问题,在多个基准测试上显著提升性能并提供了可解释的验证轨迹。
Details
Motivation: 大视觉语言模型在多模态任务中易产生幻觉,限制了其实际部署;由于训练成本高昂,需要一种无需训练、高效且可解释的幻觉缓解方法。
Result: 在幻觉基准测试中,Kestrel在POPE上平均提升3.31%,在MME-Hallucination上提升28.34分(使用Qwen3-VL模型),超越了现有基线方法,自优化模块和基础代理分别贡献了约2.0%的性能增益。
Insight: 创新点在于将显式视觉证据收集与结构化文本证据转换相结合,并通过LVLM法官进行证据验证和迭代自优化,以降低过校正风险,同时提供透明的诊断轨迹,增强了方法的可解释性和灵活性。
Abstract: Large vision-language models (LVLMs) have become increasingly strong but remain prone to hallucinations in multimodal tasks, which significantly narrows their deployment. As training these LVLMs to avoid hallucinations becomes prohibitively expensive for larger models, training-free methods offer a cheap and flexible solution to this problem, yet existing approaches based on decoding or tool use often bring limited gains and/or weak interpretability. We propose Kestrel, a training-free framework for LVLM hallucination mitigation that combines an explicit visual-grounding agent with evidence-verified self-refinement mechanism. In detail, Kestrel first collects explicit visual evidence and converts tool outputs into reusable and structured textual evidence. Second, to take full advantage of these evidence, Kestrel verifies them via an LVLM judge for evidence checking, then iteratively self-refine answers based on verified evidence to reduce the risk of over-correction. Extensive experiments show that Kestrel improves performance over strong baselines across hallucination benchmarks (e.g., average +3.31% on POPE and +28.34 on MME-Hallucination with Qwen3-VL), while providing transparent verification traces for hallucination diagnosis and analysis – e.g., both the integrated self-refinement module and grounding agent contributing an average +2.0% gain on POPE.
[94] Fast-WAM: Do World Action Models Need Test-time Future Imagination? cs.CV | cs.AIPDF
Tianyuan Yuan, Zibin Dong, Yicheng Liu, Hang Zhao
TL;DR: 本文提出Fast-WAM,一种世界动作模型(WAM)架构,它在训练时保留视频协同训练,但在测试时跳过显式的未来预测。研究发现,这种无需测试时未来想象的模型在仿真基准(如LIBERO和RoboTwin)和真实世界任务中仍能达到与最先进方法相当的性能,且推理延迟大幅降低(190毫秒,比现有’想象-执行’范式WAM快4倍以上)。
Details
Motivation: 现有WAM大多遵循’想象-执行’范式,在测试时通过迭代视频去噪进行显式未来想象,导致显著延迟。本文旨在探究WAM的性能提升主要源于训练时的视频建模,还是测试时的显式未来生成,从而设计更高效的模型。
Result: 在仿真基准LIBERO和RoboTwin以及真实世界任务上,Fast-WAM取得了与最先进方法(SOTA)相当的结果,且无需具身预训练。移除视频协同训练会导致性能大幅下降,而跳过测试时未来想象则影响较小。
Insight: 论文的核心创新在于解耦了训练时的视频建模与测试时的未来生成,表明WAM的主要价值可能在于通过视频预测提升训练时的世界表征,而非在测试时生成未来观测。这为设计低延迟、高性能的具身智能模型提供了新思路。
Abstract: World Action Models (WAMs) have emerged as a promising alternative to Vision-Language-Action (VLA) models for embodied control because they explicitly model how visual observations may evolve under action. Most existing WAMs follow an imagine-then-execute paradigm, incurring substantial test-time latency from iterative video denoising, yet it remains unclear whether explicit future imagination is actually necessary for strong action performance. In this paper, we ask whether WAMs need explicit future imagination at test time, or whether their benefit comes primarily from video modeling during training. We disentangle the role of video modeling during training from explicit future generation during inference by proposing \textbf{Fast-WAM}, a WAM architecture that retains video co-training during training but skips future prediction at test time. We further instantiate several Fast-WAM variants to enable a controlled comparison of these two factors. Across these variants, we find that Fast-WAM remains competitive with imagine-then-execute variants, while removing video co-training causes a much larger performance drop. Empirically, Fast-WAM achieves competitive results with state-of-the-art methods both on simulation benchmarks (LIBERO and RoboTwin) and real-world tasks, without embodied pretraining. It runs in real time with 190ms latency, over 4$\times$ faster than existing imagine-then-execute WAMs. These results suggest that the main value of video prediction in WAMs may lie in improving world representations during training rather than generating future observations at test time. Project page: https://yuantianyuan01.github.io/FastWAM/
[95] $x^2$-Fusion: Cross-Modality and Cross-Dimension Flow Estimation in Event Edge Space cs.CVPDF
Ruishan Guo, Ciyu Ruan, Haoyang Wang, Zihang Gong, Jingao Xu
TL;DR: 本文提出$x^2$-Fusion方法,通过将事件相机提供的时空边缘信号作为统一的潜在表示空间(Event Edge Space),实现图像、LiDAR和事件数据的跨模态与跨维度特征对齐与融合,以联合估计密集的2D光流和3D场景流。
Details
Motivation: 现有方法在结合图像、LiDAR和事件数据预测2D和3D运动时,通常在分离的异构特征空间中操作,缺乏共享的潜在空间,导致跨传感器不匹配问题未解决且融合过程复杂。
Result: 在合成和真实基准测试上的大量实验表明,$x^2$-Fusion在标准条件下达到了最先进的精度,并在具有挑战性的场景中带来了显著改进。
Insight: 创新点在于将多模态融合重构为表示统一,利用事件衍生的时空边缘定义以边缘为中心的同质空间,并在此共享表示中显式对齐图像和LiDAR特征;同时引入可靠性感知的自适应融合和跨维度对比学习来紧密耦合2D光流与3D场景流。
Abstract: Estimating dense 2D optical flow and 3D scene flow is essential for dynamic scene understanding. Recent work combines images, LiDAR, and event data to jointly predict 2D and 3D motion, yet most approaches operate in separate heterogeneous feature spaces. Without a shared latent space that all modalities can align to, these systems rely on multiple modality-specific blocks, leaving cross-sensor mismatches unresolved and making fusion unnecessarily complex.Event cameras naturally provide a spatiotemporal edge signal, which we can treat as an intrinsic edge field to anchor a unified latent representation, termed the Event Edge Space. Building on this idea, we introduce $x^2$-Fusion, which reframes multimodal fusion as representation unification: event-derived spatiotemporal edges define an edge-centric homogeneous space, and image and LiDAR features are explicitly aligned in this shared representation.Within this space, we perform reliability-aware adaptive fusion to estimate modality reliability and emphasize stable cues under degradation. We further employ cross-dimension contrast learning to tightly couple 2D optical flow with 3D scene flow. Extensive experiments on both synthetic and real benchmarks show that $x^2$-Fusion achieves state-of-the-art accuracy under standard conditions and delivers substantial improvements in challenging scenarios.
[96] Search2Motion: Training-Free Object-Level Motion Control via Attention-Consensus Search cs.CVPDF
Sainan Liu, Tz-Ying Wu, Hector A Valdez, Subarna Tripathi
TL;DR: Search2Motion是一个无需训练的图像到视频生成框架,用于对象级运动编辑。它通过基于目标帧的控制和利用首尾帧运动先验,实现对象重定位并保持场景稳定,无需微调。该框架通过语义引导的对象插入和鲁棒的背景修复来构建可靠的目标帧,并提出一种轻量级的注意力共识搜索策略(ACE-Seed)来提升运动保真度。
Details
Motivation: 解决现有方法在对象级运动编辑中需要轨迹、边界框、掩码或运动场等额外输入的问题,旨在实现更灵活、无需训练的控制,同时分离对象和相机运动以进行更准确的评估。
Result: 在提出的FLF2V-obj和VBench基准测试中,Search2Motion consistently outperforms baselines,达到了SOTA水平,特别是在稳定相机、仅对象运动的评估场景下表现优异。
Insight: 创新点包括:1) 基于目标帧的控制方法,利用首尾帧运动先验;2) 早期步自注意力图预测对象和相机动态,提供可解释的用户反馈;3) ACE-Seed轻量搜索策略,无需前瞻采样或外部评估器;4) 引入S2M-DAVIS和S2M-OMB新基准,以及FLF2V-obj指标,用于隔离对象伪影。
Abstract: We present Search2Motion, a training-free framework for object-level motion editing in image-to-video generation. Unlike prior methods requiring trajectories, bounding boxes, masks, or motion fields, Search2Motion adopts target-frame-based control, leveraging first-last-frame motion priors to realize object relocation while preserving scene stability without fine-tuning. Reliable target-frame construction is achieved through semantic-guided object insertion and robust background inpainting. We further show that early-step self-attention maps predict object and camera dynamics, offering interpretable user feedback and motivating ACE-Seed (Attention Consensus for Early-step Seed selection), a lightweight search strategy that improves motion fidelity without look-ahead sampling or external evaluators. Noting that existing benchmarks conflate object and camera motion, we introduce S2M-DAVIS and S2M-OMB for stable-camera, object-only evaluation, alongside FLF2V-obj metrics that isolate object artifacts without requiring ground-truth trajectories. Search2Motion consistently outperforms baselines on FLF2V-obj and VBench.
[97] Emotion-Aware Classroom Quality Assessment Leveraging IoT-Based Real-Time Student Monitoring cs.CVPDF
Hai Nguyen, Hieu Dao, Hung Nguyen, Nam Vu, Cong Tran
TL;DR: 本研究提出了一种基于物联网的高通量、实时多智能体情感计算框架,用于通过监测学生情绪状态来提升课堂学习质量。该系统针对物联网设备优化,解决了负载均衡和延迟问题,并在三所教育机构进行了实地测试,实现了最高50张人脸、25 FPS的检测速度,以及88%的课堂参与状态分类准确率。
Details
Motivation: 随着课堂规模扩大和师生互动有限,教育工作者迫切需要可扩展、数据驱动的工具来实时捕捉学生的情绪和参与模式,以改善教学效果。
Result: 在包含1,500张标注图像和300个课堂检测视频的Classroom Emotion Dataset上评估,系统在实地测试中达到88%的整体准确率,检测速度达25 FPS,并获得了学生、教师和家长的积极反馈。
Insight: 创新点包括建立了一个实用的、基于物联网的情绪感知学习环境框架,并引入了’Classroom Emotion Dataset’数据集以促进进一步验证和研究;从客观角度看,该研究将多智能体情感计算与物联网实时处理结合,针对教育场景的负载和延迟挑战提供了优化方案。
Abstract: This study presents high-throughput, real-time multi-agent affective computing framework designed to enhance classroom learning through emotional state monitoring. As large classroom sizes and limited teacher student interaction increasingly challenge educators, there is a growing need for scalable, data-driven tools capable of capturing students’ emotional and engagement patterns in real time. The system was evaluated using the Classroom Emotion Dataset, consisting of 1,500 labeled images and 300 classroom detection videos. Tailored for IoT devices, the system addresses load balancing and latency challenges through efficient real-time processing. Field testing was conducted across three educational institutions in a large metropolitan area: a primary school (hereafter school A), a secondary school (school B), and a high school (school C). The system demonstrated robust performance, detecting up to 50 faces at 25 FPS and achieving 88% overall accuracy in classifying classroom engagement states. Implementation results showed positive outcomes, with favorable feedback from students, teachers, and parents regarding improved classroom interaction and teaching adaptation. Key contributions of this research include establishing a practical, IoT-based framework for emotion-aware learning environments and introducing the ‘Classroom Emotion Dataset’ to facilitate further validation and research.
[98] World Reconstruction From Inconsistent Views cs.CVPDF
Lukas Höllein, Matthias Nießner
TL;DR: 本文提出了一种从视频扩散模型生成的不一致视图中重建高质量3D场景的方法。该方法首先利用几何基础模型将视频帧提升为像素级点云,然后通过非刚性迭代帧到模型ICP进行初始对齐和全局优化,最后提出逆向变形渲染损失来生成可探索的3D环境。
Details
Motivation: 解决视频扩散模型生成的帧之间缺乏3D一致性的问题,使得从这些不一致的视图中重建3D世界变得困难。
Result: 在3D场景重建质量上优于基线方法,成功将视频模型转化为3D一致的世界生成器。
Insight: 创新点包括非刚性迭代帧到模型ICP对齐方法、全局优化点云锐化技术以及逆向变形渲染损失函数,有效处理了视频帧间的非刚性变形和不一致性,实现了从2D视频到高质量3D场景的转换。
Abstract: Video diffusion models generate high-quality and diverse worlds; however, individual frames often lack 3D consistency across the output sequence, which makes the reconstruction of 3D worlds difficult. To this end, we propose a new method that handles these inconsistencies by non-rigidly aligning the video frames into a globally-consistent coordinate frame that produces sharp and detailed pointcloud reconstructions. First, a geometric foundation model lifts each frame into a pixel-wise 3D pointcloud, which contains unaligned surfaces due to these inconsistencies. We then propose a tailored non-rigid iterative frame-to-model ICP to obtain an initial alignment across all frames, followed by a global optimization that further sharpens the pointcloud. Finally, we leverage this pointcloud as initialization for 3D reconstruction and propose a novel inverse deformation rendering loss to create high quality and explorable 3D environments from inconsistent views. We demonstrate that our 3D scenes achieve higher quality than baselines, effectively turning video models into 3D-consistent world generators.
[99] When the City Teaches the Car: Label-Free 3D Perception from Infrastructure cs.CVPDF
Zhen Xu, Jinsu Yoo, Cristian Bautista, Zanming Huang, Tai-Yu Pan
TL;DR: 本文提出了一种名为‘基础设施教学、无标签3D感知’的新范式,旨在利用路边单元(RSUs)作为静态、无监督的‘教师’,通过观察未标记数据学习局部3D检测器,并将预测广播给过往车辆,作为伪标签监督来训练车辆自身的独立检测器,从而减少自动驾驶对大规模数据标注的依赖。
Details
Motivation: 自动驾驶的鲁棒3D感知严重依赖大规模数据收集和人工标注,这在跨城市和地区部署时变得不切实际。同时,现代城市越来越多地配备了路边单元(RSUs),这引发了一个问题:城市本身能否帮助训练车辆?
Result: 在基于CARLA的多智能体环境中进行概念可行性研究,使用CenterPoint检测器,该管道在车辆检测上达到了82.3%的平均精度(AP),而完全监督的车辆检测器上限为94.4%。论文还系统分析了各阶段,评估了其可扩展性,并展示了与现有以车辆为中心的无标签方法的互补性。
Insight: 核心创新点在于利用城市基础设施(RSUs)作为无监督教师,为车辆提供伪标签监督,从而构建一个无需测试时基础设施或通信的独立检测模型。这为降低3D感知的标注成本提供了一个有前景的、正交的新范式,即利用静态传感器的固定视角和重复观察来生成监督信号。
Abstract: Building robust 3D perception for self-driving still relies heavily on large-scale data collection and manual annotation, yet this paradigm becomes impractical as deployment expands across diverse cities and regions. Meanwhile, modern cities are increasingly instrumented with roadside units (RSUs), static sensors deployed along roads and at intersections to monitor traffic. This raises a natural question: can the city itself help train the vehicle? We propose infrastructure-taught, label-free 3D perception, a paradigm in which RSUs act as stationary, unsupervised teachers for ego vehicles. Leveraging their fixed viewpoints and repeated observations, RSUs learn local 3D detectors from unlabeled data and broadcast predictions to passing vehicles, which are aggregated as pseudo-label supervision for training a standalone ego detector. The resulting model requires no infrastructure or communication at test time. We instantiate this idea as a fully label-free three-stage pipeline and conduct a concept-and-feasibility study in a CARLA-based multi-agent environment. With CenterPoint, our pipeline achieves 82.3% AP for detecting vehicles, compared to a fully supervised ego upper bound of 94.4%. We further systematically analyze each stage, evaluate its scalability, and demonstrate complementarity with existing ego-centric label-free methods. Together, these results suggest that city infrastructure itself can potentially provide a scalable supervisory signal for autonomous vehicles, positioning infrastructure-taught learning as a promising orthogonal paradigm for reducing annotation cost in 3D perception.
[100] GDPO-SR: Group Direct Preference Optimization for One-Step Generative Image Super-Resolution cs.CVPDF
Qiaosi Yi, Shuai Li, Rongyuan Wu, Lingchen Sun, Zhengqiang Zhang
TL;DR: 本文提出了一种名为GDPO-SR的新方法,用于单步生成式图像超分辨率(ISR)模型的训练。该方法结合了噪声感知的单步扩散模型和一种新颖的群组直接偏好优化(GDPO)策略,以解决现有强化学习方法在单步生成ISR中样本有限和忽略局部细节的问题。
Details
Motivation: 当前强化学习方法主要关注多步生成式图像超分辨率,而单步生成由于随机性有限而研究不足。现有方法如DPO需要离线生成正负样本对导致样本有限,而GRPO仅计算整图似然而忽略对ISR至关重要的局部细节。
Result: 实验证明了GDPO在提升单步生成式ISR模型性能方面的有效性。
Insight: 主要创新点包括:1. 引入噪声感知的单步扩散模型,通过不等时间步策略解耦噪声添加和扩散的时间步,以生成多样化的ISR输出;2. 提出GDPO策略,将GRPO的原理集成到DPO中,计算在线生成样本的群组相对优势进行模型优化;3. 设计属性感知的奖励函数,基于平滑和纹理区域的统计信息动态评估每个样本的得分。
Abstract: Recently, reinforcement learning (RL) has been employed for improving generative image super-resolution (ISR) performance. However, the current efforts are focused on multi-step generative ISR, while one-step generative ISR remains underexplored due to its limited stochasticity. In addition, RL methods such as Direct Preference Optimization (DPO) require the generation of positive and negative sample pairs offline, leading to a limited number of samples, while Group Relative Policy Optimization (GRPO) only calculates the likelihood of the entire image, ignoring local details that are crucial for ISR. In this paper, we propose Group Direct Preference Optimization (GDPO), a novel approach to integrate RL into one-step generative ISR model training. First, we introduce a noise-aware one-step diffusion model that can generate diverse ISR outputs. To prevent performance degradation caused by noise injection, we introduce an unequal-timestep strategy to decouple the timestep of noise addition from that of diffusion. We then present the GDPO strategy, which integrates the principle of GRPO into DPO, to calculate the group-relative advantage of each online generated sample for model optimization. Meanwhile, an attribute-aware reward function is designed to dynamically evaluate the score of each sample based on its statistics of smooth and texture areas. Experiments demonstrate the effectiveness of GDPO in enhancing the performance of one-step generative ISR models. Code: https://github.com/Joyies/GDPO.
[101] IOSVLM: A 3D Vision-Language Model for Unified Dental Diagnosis from Intraoral Scans cs.CV | cs.AIPDF
Huimin Xiong, Zijie Meng, Tianxiang Hu, Chenyi Zhou, Yang Feng
TL;DR: 本文提出了IOSVLM,一个用于口腔内扫描(IOS)统一诊断的端到端3D视觉-语言模型。该模型将扫描数据表示为点云,采用3D编码器-投影器-LLM架构,能够进行统一诊断和生成式视觉问答(VQA)。同时,作者构建了大规模数据集IOSVQA,包含超过19,000个病例和249,055个VQA对,涵盖23种口腔疾病和多种扫描类型。
Details
Motivation: 解决现有基于2D图像或多视图渲染的口腔视觉-语言模型未能充分利用原生3D几何信息的问题,以应对口腔扫描数据形式异构、拓扑复杂、多疾病共现且类别不平衡、以及3D数据-文本配对数据有限等挑战。
Result: 在IOSVQA数据集上,IOSVLM模型持续优于强基线模型,在宏准确率上至少提升+9.58%,在宏F1分数上至少提升+1.46%,表明直接3D几何建模对基于IOS的诊断是有效的。
Insight: 主要创新点包括:1)首个用于IOS统一诊断和生成式VQA的端到端3D视觉-语言模型;2)提出了一个几何到色彩的代理方法,以弥合无色IOS数据与依赖色彩的3D预训练之间的分布差距,稳定细粒度几何感知和跨模态对齐;3)采用两阶段课程学习策略增强模型鲁棒性;4)构建了大规模、多源、涵盖多种疾病和扫描类型的IOS诊断VQA数据集IOSVQA。
Abstract: 3D intraoral scans (IOS) are increasingly adopted in routine dentistry due to abundant geometric evidence, and unified multi-disease diagnosis is desirable for clinical documentation and communication. While recent works introduce dental vision-language models (VLMs) to enable unified diagnosis and report generation on 2D images or multi-view images rendered from IOS, they do not fully leverage native 3D geometry. Such work is necessary and also challenging, due to: (i) heterogeneous scan forms and the complex IOS topology, (ii) multi-disease co-occurrence with class imbalance and fine-grained morphological ambiguity, (iii) limited paired 3D IOS-text data. Thus, we present IOSVLM, an end-to-end 3D VLM that represents scans as point clouds and follows a 3D encoder-projector-LLM design for unified diagnosis and generative visual question-answering (VQA), together with IOSVQA, a large-scale multi-source IOS diagnosis VQA dataset comprising 19,002 cases and 249,055 VQA pairs over 23 oral diseases and heterogeneous scan types. To address the distribution gap between color-free IOS data and color-dependent 3D pre-training, we propose a geometry-to-chromatic proxy that stabilizes fine-grained geometric perception and cross-modal alignment. A two-stage curriculum training strategy further enhances robustness. IOSVLM consistently outperforms strong baselines, achieving gains of at least +9.58% macro accuracy and +1.46% macro F1, indicating the effectiveness of direct 3D geometry modeling for IOS-based diagnosis.
[102] V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising cs.CV | cs.AIPDF
Han Lin, Xichen Pan, Zun Wang, Yue Zhang, Chu Wang
TL;DR: 本文提出了V-Co,一个基于即时(JiT)框架的系统性视觉协同去噪研究,旨在探究如何有效利用预训练视觉特征来增强像素空间扩散模型。研究发现并验证了四个关键设计要素:完全双流架构、结构化的无条件预测、感知漂移混合损失以及基于RMS的特征重缩放。在ImageNet-256上的实验表明,V-Co在模型大小相当的情况下,优于基线像素空间扩散模型和先前的强方法,且训练轮次更少。
Details
Motivation: 现有视觉协同去噪方法往往将多个设计选择纠缠在一起,导致难以确定哪些设计真正关键。本文旨在通过一个统一的框架,系统性地研究视觉协同去噪的有效成分,为未来基于表示对齐的生成模型提供实用指导。
Result: 在ImageNet-256基准测试中,V-Co在模型规模可比的情况下,超越了其基础的像素空间扩散基线以及先前强大的像素扩散方法,同时使用了更少的训练轮次。
Insight: 本文的核心创新在于通过系统性实验,分离并验证了视觉协同去噪有效的四个关键设计要素:完全双流架构、结构化的无条件预测以实现有效的无分类器指导、感知漂移混合损失以提供更强的语义监督,以及基于RMS的跨流校准以确保稳定训练。这为构建更高效的表示对齐生成模型提供了清晰的配方。
Abstract: Pixel-space diffusion has recently re-emerged as a strong alternative to latent diffusion, enabling high-quality generation without pretrained autoencoders. However, standard pixel-space diffusion models receive relatively weak semantic supervision and are not explicitly designed to capture high-level visual structure. Recent representation-alignment methods (e.g., REPA) suggest that pretrained visual features can substantially improve diffusion training, and visual co-denoising has emerged as a promising direction for incorporating such features into the generative process. However, existing co-denoising approaches often entangle multiple design choices, making it unclear which design choices are truly essential. Therefore, we present V-Co, a systematic study of visual co-denoising in a unified JiT-based framework. This controlled setting allows us to isolate the ingredients that make visual co-denoising effective. Our study reveals four key ingredients for effective visual co-denoising. First, preserving feature-specific computation while enabling flexible cross-stream interaction motivates a fully dual-stream architecture. Second, effective classifier-free guidance (CFG) requires a structurally defined unconditional prediction. Third, stronger semantic supervision is best provided by a perceptual-drifting hybrid loss. Fourth, stable co-denoising further requires proper cross-stream calibration, which we realize through RMS-based feature rescaling. Together, these findings yield a simple recipe for visual co-denoising. Experiments on ImageNet-256 show that, at comparable model sizes, V-Co outperforms the underlying pixel-space diffusion baseline and strong prior pixel-diffusion methods while using fewer training epochs, offering practical guidance for future representation-aligned generative models.
[103] WildDepth: A Multimodal Dataset for 3D Wildlife Perception and Depth Estimation cs.CV | cs.DLPDF
Muhammad Aamir, Naoya Muramatsu, Sangyun Shin, Matthew Wijers, Jiaxing Jhong
TL;DR: 本文介绍了WildDepth,一个用于3D野生动物感知和深度估计的多模态数据集,包含从家养到野外环境中多种动物的同步RGB和LiDAR数据,旨在支持深度估计、行为检测和3D重建任务。
Details
Motivation: 现有动物相关模型大多基于无度量尺度的数据集训练,限制了图像模型的验证,WildDepth旨在通过提供带度量尺度的多模态数据来解决这一局限性。
Result: 实验结果表明,使用多模态数据可将深度估计的RMSE提升高达10%,RGB-LiDAR融合使3D重建的Chamfer距离改善12%,在WildDepth基准上验证了性能提升。
Insight: 创新点在于构建了首个涵盖广泛动物类别、具有同步RGB-LiDAR和度量尺度的多模态数据集,为跨领域鲁棒感知系统提供了基准,强调了多模态融合在提升深度估计和3D重建精度方面的价值。
Abstract: Depth estimation and 3D reconstruction have been extensively studied as core topics in computer vision. Starting from rigid objects with relatively simple geometric shapes, such as vehicles, the research has expanded to address general objects, including challenging deformable objects, such as humans and animals. However, for the animal, in particular, the majority of existing models are trained based on datasets without metric scale, which can help validate image-only models. To address this limitation, we present WildDepth, a multimodal dataset and benchmark suite for depth estimation, behavior detection, and 3D reconstruction from diverse categories of animals ranging from domestic to wild environments with synchronized RGB and LiDAR. Experimental results show that the use of multi-modal data improves depth reliability by up to 10% RMSE, while RGB-LiDAR fusion enhances 3D reconstruction fidelity by 12% in Chamfer distance. By releasing WildDepth and its benchmarks, we aim to foster robust multimodal perception systems that generalize across domains.
[104] Deep Reinforcement Learning-driven Edge Offloading for Latency-constrained XR pipelines cs.CVPDF
Sourya Saha, Saptarshi Debroy
TL;DR: 本文提出了一种基于深度强化学习的电池感知执行管理框架,用于边缘辅助的XR系统,旨在联合优化执行放置、工作负载质量、延迟要求和电池动态,以在满足严格延迟约束的同时,显著延长设备电池寿命。
Details
Motivation: 现有自适应执行和计算卸载方法通常优化平均性能指标,未能充分捕捉闭环XR工作负载中实时延迟要求与设备电池寿命之间的持续交互,因此需要一种能同时考虑延迟和能耗的联合优化框架。
Result: 实验结果表明,与延迟最优的本地执行相比,所提方法在稳定网络条件下将预计设备电池寿命延长了高达163%,同时保持超过90%的运动到光子延迟合规性;即使在网络带宽严重受限的情况下,合规性也不低于80%。
Insight: 创新点在于设计了一个基于轻量级深度强化学习的在线决策机制,能够动态适应网络条件并维持高延迟合规性,明确管理了沉浸式XR系统中的延迟-能量权衡,为实时关键工作负载的边缘卸载提供了新思路。
Abstract: Immersive extended reality (XR) applications introduce latency-critical workloads that must satisfy stringent real-time responsiveness while operating on energy- and battery-constrained devices, making execution placement between end devices and nearby edge servers a fundamental systems challenge. Existing approaches to adaptive execution and computation offloading typically optimize average performance metrics and do not fully capture the sustained interaction between real-time latency requirements and device battery lifetime in closed-loop XR workloads. In this paper, we present a battery-aware execution management framework for edge-assisted XR systems that jointly considers execution placement, workload quality, latency requirements, and battery dynamics. We design an online decision mechanism based on a lightweight deep reinforcement learning policy that continuously adapts execution decisions under dynamic network conditions while maintaining high motion-to-photon latency compliance. Experimental results show that the proposed approach extends the projected device battery lifetime by up to 163% compared to latency-optimal local execution while maintaining over 90% motion-to-photon latency compliance under stable network conditions. Such compliance does not fall below 80% even under significantly limited network bandwidth availability, thereby demonstrating the effectiveness of explicitly managing latency-energy trade-offs in immersive XR systems.
[105] What DINO saw: ALiBi positional encoding reduces positional bias in Vision Transformers cs.CV | cond-mat.mtrl-sciPDF
Moritz Pawlowsky, Antonis Vamvakeros, Alexander Weiss, Anja Bielefeld, Samuel J. Cooper
TL;DR: 该论文研究了视觉变换器(ViT)中因位置编码选择导致的位置偏差问题,特别是在材料科学等领域的零样本适应中,图像通常为均匀微结构的横截面(无方向偏好)。作者通过线性探测发现多种目标和位置编码均存在位置偏差,并通过微调模型使用ALiBi相对位置编码来减少这种偏差,同时保持模型的一般语义能力,最终在复杂显微镜图像的可训练分割中成功应用了无偏特征。
Details
Motivation: 解决视觉变换器(尤其是像DINOv2这样的特征基础模型)因位置编码等架构选择导致的位置偏差和伪影问题,这些偏差独立于语义内容,使得在材料科学等领域的零样本适应变得困难。
Result: 通过线性探测验证了位置偏差存在于多种目标和位置编码中,微调使用ALiBi位置编码的模型后,位置偏差减少,同时模型保留了良好的通用语义,其无偏特征在复杂显微镜图像的可训练分割中表现成功。
Insight: 创新点在于系统性地分析ViT的位置偏差,并提出通过微调采用ALiBi相对位置编码来减少偏差,同时保持语义表示质量,这为在无方向偏好图像(如材料科学中的均匀微结构)中应用ViT提供了改进方法。
Abstract: Vision transformers (ViTs) - especially feature foundation models like DINOv2 - learn rich representations useful for many downstream tasks. However, architectural choices (such as positional encoding) can lead to these models displaying positional biases and artefacts independent of semantic content. This makes zero-shot adaption difficult in fields like material science, where images are often cross-sections of homogeneous microstructure (i.e. having no preferred direction). In this work, we investigate the positional bias in ViTs via linear probing, finding it present across a range of objectives and positional encodings, and subsequently reduce it by finetuning models to use ALiBi relative positional encoding. We demonstrate that these models retain desirable general semantics and their unbiased features can be used successfully in trainable segmentation of complex microscopy images.
[106] M^3: Dense Matching Meets Multi-View Foundation Models for Monocular Gaussian Splatting SLAM cs.CVPDF
Kerui Ren, Guanghao Li, Changjian Jiang, Yingxiang Xu, Tao Lu
TL;DR: 本文提出M^3方法,通过为多视图基础模型添加专用的匹配头来获取细粒度密集对应关系,并将其集成到鲁棒的单目高斯溅射SLAM系统中,以解决未标定单目视频流在线重建中姿态估计精度不足和动态环境高效优化的问题。
Details
Motivation: 解决未标定单目视频流在线重建的挑战,即需要高精度姿态估计和动态环境中计算高效的在线优化,而现有结合3D基础模型与SLAM的方法因前馈式姿态估计产生的像素级对应关系缺乏几何优化所需精度。
Result: 在多种室内外基准测试上达到最先进的姿态估计和场景重建精度,在ScanNet++数据集上比VGGT-SLAM 2.0降低ATE RMSE 64.3%,比ARTDECO在PSNR上提升2.11 dB。
Insight: 创新点在于为多视图基础模型设计专用匹配头以获取细粒度密集对应,并集成动态区域抑制和交叉推理内参对齐机制提升跟踪稳定性,实现了基础模型几何感知能力与SLAM优化框架的有效结合。
Abstract: Streaming reconstruction from uncalibrated monocular video remains challenging, as it requires both high-precision pose estimation and computationally efficient online refinement in dynamic environments. While coupling 3D foundation models with SLAM frameworks is a promising paradigm, a critical bottleneck persists: most multi-view foundation models estimate poses in a feed-forward manner, yielding pixel-level correspondences that lack the requisite precision for rigorous geometric optimization. To address this, we present M^3, which augments the Multi-view foundation model with a dedicated Matching head to facilitate fine-grained dense correspondences and integrates it into a robust Monocular Gaussian Splatting SLAM. M^3 further enhances tracking stability by incorporating dynamic area suppression and cross-inference intrinsic alignment. Extensive experiments on diverse indoor and outdoor benchmarks demonstrate state-of-the-art accuracy in both pose estimation and scene reconstruction. Notably, M^3 reduces ATE RMSE by 64.3% compared to VGGT-SLAM 2.0 and outperforms ARTDECO by 2.11 dB in PSNR on the ScanNet++ dataset.
[107] SparkVSR: Interactive Video Super-Resolution via Sparse Keyframe Propagation cs.CV | cs.AIPDF
Jiongze Yu, Xiangbo Gao, Pooja Verlani, Akshay Gadde, Yilin Wang
TL;DR: SparkVSR是一个交互式视频超分辨率框架,允许用户通过稀疏关键帧作为控制信号来引导超分过程。它首先使用图像超分模型处理少量关键帧,然后通过一个两阶段训练管道将关键帧的先验信息传播到整个视频序列,同时保持与原始低分辨率视频运动的一致性。
Details
Motivation: 解决现有视频超分辨率方法在推理时像黑盒一样,用户无法可靠地纠正意外伪影,只能被动接受模型输出结果的问题。
Result: 在多个VSR基准测试中,SparkVSR在时间一致性和恢复质量上表现出色,在CLIP-IQA、DOVER和MUSIQ指标上分别超过基线模型达24.6%、21.8%和5.6%。
Insight: 创新点在于将稀疏关键帧作为一种简单而富有表现力的交互式控制信号,并设计了一个关键帧条件化的潜空间-像素两阶段训练管道,实现了可控、关键帧驱动的视频超分,且该框架具有通用性,可应用于老电影修复和视频风格迁移等未见任务。
Abstract: Video Super-Resolution (VSR) aims to restore high-quality video frames from low-resolution (LR) estimates, yet most existing VSR approaches behave like black boxes at inference time: users cannot reliably correct unexpected artifacts, but instead can only accept whatever the model produces. In this paper, we propose a novel interactive VSR framework dubbed SparkVSR that makes sparse keyframes a simple and expressive control signal. Specifically, users can first super-resolve or optionally a small set of keyframes using any off-the-shelf image super-resolution (ISR) model, then SparkVSR propagates the keyframe priors to the entire video sequence while remaining grounded by the original LR video motion. Concretely, we introduce a keyframe-conditioned latent-pixel two-stage training pipeline that fuses LR video latents with sparsely encoded HR keyframe latents to learn robust cross-space propagation and refine perceptual details. At inference time, SparkVSR supports flexible keyframe selection (manual specification, codec I-frame extraction, or random sampling) and a reference-free guidance mechanism that continuously balances keyframe adherence and blind restoration, ensuring robust performance even when reference keyframes are absent or imperfect. Experiments on multiple VSR benchmarks demonstrate improved temporal consistency and strong restoration quality, surpassing baselines by up to 24.6%, 21.8%, and 5.6% on CLIP-IQA, DOVER, and MUSIQ, respectively, enabling controllable, keyframe-driven video super-resolution. Moreover, we demonstrate that SparkVSR is a generic interactive, keyframe-conditioned video processing framework as it can be applied out of the box to unseen tasks such as old-film restoration and video style transfer. Our project page is available at: https://sparkvsr.github.io/
[108] SegviGen: Repurposing 3D Generative Model for Part Segmentation cs.CVPDF
Lin Li, Haoran Feng, Zehuan Huang, Haohua Chen, Wenbo Nie
TL;DR: SegviGen是一个利用预训练3D生成模型进行3D部件分割的框架。它通过为几何对齐重建中的活跃体素预测部件指示性颜色,将生成模型中的结构化先验知识重新用于分割任务,从而避免了传统方法中存在的多视图不一致、边界模糊以及大规模标注数据需求等问题。
Details
Motivation: 现有3D部件分割方法要么通过蒸馏或多视图掩码聚合将2D先验提升到3D,存在跨视图不一致和边界模糊问题;要么探索原生3D判别式分割,但需要大规模标注3D数据和大量训练资源。SegviGen旨在利用预训练3D生成模型中编码的结构化先验,以更高效、数据需求更少的方式实现3D部件分割。
Result: 在交互式部件分割任务上,SegviGen比之前的最先进方法提升了40%;在全分割任务上提升了15%,同时仅使用了0.32%的标注训练数据。
Insight: 核心创新点在于重新利用预训练3D生成模型作为强大的结构化先验,通过部件着色诱导分割,实现了数据高效且性能优越的分割。这证明了3D生成先验可以有效地迁移到3D分割任务,在有限监督下实现强大性能。框架统一支持交互式分割、全分割以及带2D引导的全分割。
Abstract: We introduce SegviGen, a framework that repurposes native 3D generative models for 3D part segmentation. Existing pipelines either lift strong 2D priors into 3D via distillation or multi-view mask aggregation, often suffering from cross-view inconsistency and blurred boundaries, or explore native 3D discriminative segmentation, which typically requires large-scale annotated 3D data and substantial training resources. In contrast, SegviGen leverages the structured priors encoded in pretrained 3D generative model to induce segmentation through distinctive part colorization, establishing a novel and efficient framework for part segmentation. Specifically, SegviGen encodes a 3D asset and predicts part-indicative colors on active voxels of a geometry-aligned reconstruction. It supports interactive part segmentation, full segmentation, and full segmentation with 2D guidance in a unified framework. Extensive experiments show that SegviGen improves over the prior state of the art by 40% on interactive part segmentation and by 15% on full segmentation, while using only 0.32% of the labeled training data. It demonstrates that pretrained 3D generative priors transfer effectively to 3D part segmentation, enabling strong performance with limited supervision. See our project page at https://fenghora.github.io/SegviGen-Page/.
[109] Demystifing Video Reasoning cs.CV | cs.AIPDF
Ruisi Wang, Zhongang Cai, Fanyi Pu, Junxiang Xu, Wanqi Yin
TL;DR: 本文挑战了先前关于视频生成模型中推理机制的观点,揭示了推理主要沿扩散去噪步骤而非视频帧序列展开,并提出了Chain-of-Steps(CoS)机制。研究发现模型在早期去噪步骤中探索多个候选解,逐步收敛至最终答案,并识别出工作记忆、自我纠正与增强、先感知后行动等关键推理行为。此外,研究还发现了扩散Transformer中自演化的功能专业化现象,并提出了一种无需训练的集成策略来提升推理性能。
Details
Motivation: 先前研究将视频生成模型的推理能力归因于跨视频帧的Chain-of-Frames(CoF)机制,本文旨在挑战这一假设,探究推理在扩散模型中的真实涌现机制。
Result: 通过定性分析和针对性探测实验,研究揭示了推理沿去噪步骤展开的CoS机制,并发现扩散Transformer中早期层编码感知结构、中间层执行推理、后期层整合表征的功能专业化。提出的基于不同随机种子的潜在轨迹集成策略,作为概念验证展示了推理改进的可能性。
Insight: 创新点在于揭示了扩散视频模型中推理沿去噪步骤(CoS)而非帧序列(CoF)涌现的核心机制,并系统识别了工作记忆、自我纠正、先感知后行动等关键行为,以及扩散Transformer的自演化功能专业化,为利用视频模型作为智能新基质提供了理论基础。
Abstract: Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative analysis and targeted probing experiments, we find that models explore multiple candidate solutions in early denoising steps and progressively converge to a final answer, a process we term Chain-of-Steps (CoS). Beyond this core mechanism, we identify several emergent reasoning behaviors critical to model performance: (1) working memory, enabling persistent reference; (2) self-correction and enhancement, allowing recovery from incorrect intermediate solutions; and (3) perception before action, where early steps establish semantic grounding and later steps perform structured manipulation. During a diffusion step, we further uncover self-evolved functional specialization within Diffusion Transformers, where early layers encode dense perceptual structure, middle layers execute reasoning, and later layers consolidate latent representations. Motivated by these insights, we present a simple training-free strategy as a proof-of-concept, demonstrating how reasoning can be improved by ensembling latent trajectories from identical models with different random seeds. Overall, our work provides a systematic understanding of how reasoning emerges in video generation models, offering a foundation to guide future research in better exploiting the inherent reasoning dynamics of video models as a new substrate for intelligence.
[110] WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation cs.CVPDF
Jisu Nam, Yicong Hong, Chun-Hao Paul Huang, Feng Liu, JoungBin Lee
TL;DR: 该论文提出了WorldCam,一种基于相机姿态作为统一几何表示的交互式自回归3D游戏世界生成模型。通过将用户动作映射为精确的6自由度相机姿态,并将其作为空间索引来检索历史观测,该方法在动作控制精度、长时程视觉质量和3D空间一致性方面显著优于现有方法。
Details
Motivation: 现有视频扩散Transformer方法在构建交互式游戏世界时,难以实现精确的动作控制和长时程3D一致性,主要原因是它们将用户动作视为抽象条件信号,忽略了动作与3D世界之间基本的几何耦合关系。
Result: 在引入的大规模数据集(包含3000分钟带相机轨迹和文本标注的真实人类游戏录像)上进行的大量实验表明,该方法在动作可控性、长时程视觉质量和3D空间一致性方面大幅超越了最先进的交互式游戏世界模型(SOTA)。
Insight: 创新点在于将相机姿态确立为统一几何表示,以同时支撑即时动作控制和长期3D一致性。具体包括:1) 基于物理定义连续动作空间,将用户输入表示为李代数以推导精确6-DoF相机姿态,并通过相机嵌入器注入生成模型确保动作对齐;2) 将全局相机姿态用作空间索引来检索相关历史观测,实现长时程导航中几何一致的位置重访。
Abstract: Recent advances in video diffusion transformers have enabled interactive gaming world models that allow users to explore generated environments over extended horizons. However, existing approaches struggle with precise action control and long-horizon 3D consistency. Most prior works treat user actions as abstract conditioning signals, overlooking the fundamental geometric coupling between actions and the 3D world, whereby actions induce relative camera motions that accumulate into a global camera pose within a 3D world. In this paper, we establish camera pose as a unifying geometric representation to jointly ground immediate action control and long-term 3D consistency. First, we define a physics-based continuous action space and represent user inputs in the Lie algebra to derive precise 6-DoF camera poses, which are injected into the generative model via a camera embedder to ensure accurate action alignment. Second, we use global camera poses as spatial indices to retrieve relevant past observations, enabling geometrically consistent revisiting of locations during long-horizon navigation. To support this research, we introduce a large-scale dataset comprising 3,000 minutes of authentic human gameplay annotated with camera trajectories and textual descriptions. Extensive experiments show that our approach substantially outperforms state-of-the-art interactive gaming world models in action controllability, long-horizon visual quality, and 3D spatial consistency.
cs.RO [Back]
[111] The Era of End-to-End Autonomy: Transitioning from Rule-Based Driving to Large Driving Models cs.RO | cs.CV | eess.IVPDF
Eduardo Nebot, Julie Stephany Berrio Perez
TL;DR: 本文探讨了自动驾驶从基于规则的模块化系统向端到端学习系统的转变,分析了特斯拉FSD V12/V14、Rivian统一智能平台、NVIDIA Cosmos等大型驾驶模型的发展,并讨论了从2026年开始部署的监督式端到端驾驶系统(如FSD Supervised)的商业化趋势及其在复杂环境中处理长尾场景的能力。
Details
Motivation: 自动驾驶领域正从传统的感知-规划-控制模块化架构转向能够直接从原始传感器输入映射到驾驶动作的端到端学习系统,本文旨在分析这一技术转型的演进路径、关键系统设计及行业影响。
Result: 早期运营证据表明,端到端学习系统能够有效处理现实驾驶场景中的长尾分布问题,并正成为主导的商业策略,多家制造商计划从2026年起部署监督式端到端驾驶系统(L2++级别)。
Insight: 论文指出端到端学习通过大型驾驶模型直接处理传感器到动作的映射,提升了系统对复杂场景的适应能力;其创新在于将驾驶任务整合为单一学习模型,并可能将类似架构扩展至人形机器人等其他具身AI系统。
Abstract: Autonomous driving is undergoing a shift from modular rule based pipelines toward end to end (E2E) learning systems. This paper examines this transition by tracing the evolution from classical sense perceive plan control architectures to large driving models (LDMs) capable of mapping raw sensor input directly to driving actions. We analyze recent developments including Tesla’s Full Self Driving (FSD) V12 V14, Rivian’s Unified Intelligence platform, NVIDIA Cosmos, and emerging commercial robotaxi deployments, focusing on architectural design, deployment strategies, safety considerations and industry implications. A key emerging product category is supervised E2E driving, often referred to as FSD (Supervised) or L2 plus plus, which several manufacturers plan to deploy from 2026 onwards. These systems can perform most of the Dynamic Driving Task (DDT) in complex environments while requiring human supervision, shifting the driver’s role to safety oversight. Early operational evidence suggests E2E learning handles the long tail distribution of real world driving scenarios and is becoming a dominant commercial strategy. We also discuss how similar architectural advances may extend beyond autonomous vehicles (AV) to other embodied AI systems, including humanoid robotics.
[112] Towards the Vision-Sound-Language-Action Paradigm: The HEAR Framework for Sound-Centric Manipulation cs.RO | cs.AI | cs.CV | cs.SDPDF
Chang Nie, Tianchen Deng, Guangming Wang, Zhe Liu, Hesheng Wang
TL;DR: 本文提出了一种新的多模态控制范式——视觉-声音-语言-动作(VSLA),并引入了HEAR框架,旨在解决现有视觉-语言-动作(VLA)模型在处理实时、以声音为中心的操作任务时存在的不足,特别是对短暂环境声学信号的连续感知和利用问题。
Details
Motivation: 现有VLA模型通常将声音视为静态的预执行提示或仅关注人类语音,忽略了在任务执行过程中,短暂的环境声音对于状态验证的关键作用,且由于低频更新、系统延迟和开环执行中的动作分块,导致关键声音信号容易丢失,形成“盲执行区间”。
Result: 在构建的专有基准HEAR-Bench(首个具有严格因果时序规则的以声音为中心的操作基准)上,HEAR框架展示了其有效性,结果表明,鲁棒的以声音为中心的操作需要因果持久性和显式的时间动态学习。
Insight: 论文的主要创新点在于:1)形式化了VSLA这一新的连续控制范式;2)提出了HEAR框架,其核心组件(流式历史记录器、多感官推理器、音频世界模型预测器和流匹配策略)分别解决了音频上下文维护、多模态推理、时间动态学习和平滑动作生成问题;3)构建了用于预训练和评估的数据集与基准,填补了该领域的数据空白。从客观角度看,将声音作为实时、连续的感知流进行整合,并强调其因果时序特性,是迈向具身智能多感官基础模型的重要一步。
Abstract: While recent Vision-Language-Action (VLA) models have begun to incorporate audio, they typically treat sound as static pre-execution prompts or focus exclusively on human speech. This leaves a significant gap in real-time, sound-centric manipulation where fleeting environmental acoustics provide critical state verification during task execution. Consequently, key sounds are easily missed due to low-frequency updates or system latency. This problem is exacerbated by action chunking with open-loop execution, which creates a Blind Execution Interval where acoustic events are lost between discrete audio observation windows. Recognizing the necessity of continuous auditory awareness, we formalize Vision-Sound-Language-Action (VSLA) as a continuous control paradigm conditioned on vision, streaming audio, language, and proprioception under delayed decision loops. As an instantiation, we introduce HEAR, a VSLA framework integrating four components: (i) a streaming Historizer to maintain a compact, causal audio context across execution gaps; (ii) an Envisioner adapted from omni foundation models to reason over multi-sensory inputs; (iii) an Advancer, formulated as an audio world model, to learn temporal dynamics by predicting near-future audio codes; and (iv) a flow-matching Realizer policy to generate smooth action chunks. To address the scarcity of pretraining data and evaluations for VSLA, we construct OpenX-Sound for pretraining, alongside HEAR-Bench, the first sound-centric manipulation benchmark with strict causal timing rules. Our results suggest that robust sound-centric manipulation necessitates causal persistence and explicit temporal learning. This framework provides a practical step toward multi-sensory foundation models for embodied agents, enabling robots to perceive and interact with dynamic environments. Code and videos are available at https://hear.irmv.top.
[113] Kinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation cs.RO | cs.CVPDF
Mutian Xu, Tianbao Zhang, Tianqi Liu, Zhaoxi Chen, Xiaoguang Han
TL;DR: 本文提出了Kinema4D,一种新的动作条件4D生成式机器人模拟器,旨在恢复机器人-世界交互的4D时空本质。它将交互解耦为精确的4D机器人控制轨迹和生成式的环境反应4D建模,并利用一个名为Robo4D-200k的大规模数据集进行训练。实验表明,该方法能有效模拟物理合理、几何一致且与具体机器人形态无关的交互,并首次展示了潜在的零样本迁移能力。
Details
Motivation: 现有基于视频生成的模拟方法主要在2D空间运行或受静态环境线索引导,忽略了机器人-世界交互本质上是需要精确交互建模的4D时空事件。本文旨在恢复这一4D本质,同时确保精确的机器人控制。
Result: 广泛的实验表明,该方法能有效模拟物理合理、几何一致且与具体机器人形态无关的交互,并首次展示了潜在的零样本迁移能力,为推进下一代具身模拟提供了高保真基础。
Insight: 创新点在于将机器人-世界交互解耦为精确的4D机器人控制轨迹(通过运动学驱动URDF模型)和生成式的环境反应4D建模(通过将轨迹投影为点云图作为时空视觉信号来控制生成模型)。这提供了一种新的、更符合物理本质的具身模拟范式,并构建了大规模高质量4D标注数据集以支持训练。
Abstract: Simulating robot-world interactions is a cornerstone of Embodied AI. Recently, a few works have shown promise in leveraging video generations to transcend the rigid visual/physical constraints of traditional simulators. However, they primarily operate in 2D space or are guided by static environmental cues, ignoring the fundamental reality that robot-world interactions are inherently 4D spatiotemporal events that require precise interactive modeling. To restore this 4D essence while ensuring the precise robot control, we introduce Kinema4D, a new action-conditioned 4D generative robotic simulator that disentangles the robot-world interaction into: i) Precise 4D representation of robot controls: we drive a URDF-based 3D robot via kinematics, producing a precise 4D robot control trajectory. ii) Generative 4D modeling of environmental reactions: we project the 4D robot trajectory into a pointmap as a spatiotemporal visual signal, controlling the generative model to synthesize complex environments’ reactive dynamics into synchronized RGB/pointmap sequences. To facilitate training, we curated a large-scale dataset called Robo4D-200k, comprising 201,426 robot interaction episodes with high-quality 4D annotations. Extensive experiments demonstrate that our method effectively simulates physically-plausible, geometry-consistent, and embodiment-agnostic interactions that faithfully mirror diverse real-world dynamics. For the first time, it shows potential zero-shot transfer capability, providing a high-fidelity foundation for advancing next-generation embodied simulation.
[114] vAccSOL: Efficient and Transparent AI Vision Offloading for Mobile Robots cs.RO | cs.CVPDF
Adam Zahir, Michele Gucciardom Falk Selker, Anastasios Nanos, Kostis Papazafeiropoulos, Carlos J. Bernardos
TL;DR: 本文提出了vAccSOL框架,旨在解决移动机器人上运行AI视觉任务时计算资源有限和能耗高的问题。该框架通过SOL神经网络编译器生成优化的推理库,并结合vAccel轻量级执行框架,实现本地或边缘卸载的透明调度,从而提升推理效率并降低能耗。
Details
Motivation: 移动机器人依赖计算机视觉进行感知和决策,但机载计算资源有限且能耗约束严格,现有嵌入式加速器通常绑定专有软件栈,导致用户定义任务在资源受限的伴侣计算机上运行效率低下。
Result: 在真实测试平台上,使用商业四足机器人和12个深度学习模型进行评估。SOL编译器相比PyTorch基线实现了相当或更好的推理性能;通过边缘卸载,vAccSOL将机器人端功耗降低高达80%,边缘端功耗降低高达60%,同时视觉流水线帧率提升高达24倍,延长了电池供电机器人的运行时间。
Insight: 创新点在于结合神经网络编译器与轻量级执行框架,实现硬件优化推理和透明执行位置调度,无需修改机器人应用;客观分析认为,该框架通过异构平台协同和低运行时依赖设计,在能效和性能平衡方面具有借鉴价值。
Abstract: Mobile robots are increasingly deployed for inspection, patrol, and search-and-rescue operations, relying on computer vision for perception, navigation, and autonomous decision-making. However, executing modern vision workloads onboard is challenging due to limited compute resources and strict energy constraints. While some platforms include embedded accelerators, these are typically tied to proprietary software stacks, leaving user-defined workloads to run on resource-constrained companion computers. We present vAccSOL, a framework for efficient and transparent execution of AI-based vision workloads across heterogeneous robotic and edge platforms. vAccSOL integrates two components: SOL, a neural network compiler that generates optimized inference libraries with minimal runtime dependencies, and vAccel, a lightweight execution framework that transparently dispatches inference locally on the robot or to nearby edge infrastructure. This combination enables hardware-optimized inference and flexible execution placement without requiring modifications to robot applications. We evaluate vAccSOL on a real-world testbed with a commercial quadruped robot and twelve deep learning models covering image classification, video classification, and semantic segmentation. Compared to a PyTorch compiler baseline, SOL achieves comparable or better inference performance. With edge offloading, vAccSOL reduces robot-side power consumption by up to 80% and edge-side power by up to 60% compared to PyTorch, while increasing vision pipeline frame rate by up to 24x, extending the operating lifetime of battery-powered robots.
cs.SE [Back]
[115] SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding cs.SE | cs.AI | cs.CLPDF
Songcheng Cai, Zhiheng Lyu, Yuansheng Ni, Xiangchao Chen, Baichuan Zhou
TL;DR: 本文提出了SWE-QA-Pro基准测试,用于评估仓库级代码理解能力,该基准通过问题驱动的聚类确保主题平衡并过滤简单问题,以准确衡量智能体工作流的性能。同时,作者设计了一个可扩展的合成数据管道和两阶段训练方法(SFT+RLAIF),使小型开源模型在基准上超越了GPT-4o,并显著缩小了与顶尖专有模型的差距。
Details
Motivation: 当前仓库级代码理解领域缺乏可靠的基准,现有评估常忽略长尾主题并可能让大语言模型通过记忆知识作弊,因此需要构建一个能真实反映智能体探索必要性的评测集。
Result: 在SWE-QA-Pro基准上,智能体工作流显著优于直接回答基线(如Claude Sonnet 4.5有约13分差距)。使用所提训练方法训练的Qwen3-8B模型在该基准上超越了GPT-4o 2.3分,并大幅缩小了与最先进专有模型的差距。
Insight: 创新点在于构建了一个基于多样化、长尾仓库且包含可执行环境的基准,并通过问题驱动聚类和难度校准确保其代表性;同时提出了一个可扩展的合成数据生成与两阶段训练流程,有效教会小型模型使用工具和进行推理,为训练复杂智能体行为提供了新方案。
Abstract: Agentic repository-level code understanding is essential for automating complex software engineering tasks, yet the field lacks reliable benchmarks. Existing evaluations often overlook the long tail topics and rely on popular repositories where Large Language Models (LLMs) can cheat via memorized knowledge. To address this, we introduce SWE-QA-Pro, a benchmark constructed from diverse, long-tail repositories with executable environments. We enforce topical balance via issue-driven clustering to cover under-represented task types and apply a rigorous difficulty calibration process: questions solvable by direct-answer baselines are filtered out. This results in a dataset where agentic workflows significantly outperform direct answering (e.g., a ~13-point gap for Claude Sonnet 4.5), confirming the necessity of agentic codebase exploration. Furthermore, to tackle the scarcity of training data for such complex behaviors, we propose a scalable synthetic data pipeline that powers a two-stage training recipe: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning from AI Feedback (RLAIF). This approach allows small open models to learn efficient tool usage and reasoning. Empirically, a Qwen3-8B model trained with our recipe surpasses GPT-4o by 2.3 points on SWE-QA-Pro and substantially narrows the gap to state-of-the-art proprietary models, demonstrating both the validity of our evaluation and the effectiveness of our agentic training workflow.
cs.AI [Back]
[116] BrainBench: Exposing the Commonsense Reasoning Gap in Large Language Models cs.AI | cs.CLPDF
Yuzhe Tang
TL;DR: 本文介绍了BrainBench,一个包含100个脑筋急转弯问题的基准测试,涵盖20个精心设计的类别,旨在揭示大型语言模型在常识推理方面的缺陷。评估了Claude和GPT家族的八个前沿模型,发现即使最佳模型Claude Opus 4.6的准确率也仅为80.3%,而最差模型GPT-4o为39.7%。模型在准确性和一致性之间存在差距,且跨语言评估显示性能下降,表明这些失败反映了推理缺陷而非语言特定问题。
Details
Motivation: 尽管大型语言模型在标准基准测试中表现优异,但它们经常在人类能轻松回答的常识推理问题上失败,因此需要一个新的基准来暴露这些推理差距。
Result: 在BrainBench上,最佳模型Claude Opus 4.6(带扩展思考)的准确率为80.3%,最差模型GPT-4o为39.7%;模型在准确性和一致性之间存在6-16个百分点的差距,且中文评估中性能下降2-8个百分点。
Insight: BrainBench提供了一个细粒度的诊断工具,用于识别大型语言模型在哪些方面用表面启发式方法替代真正的常识推理,揭示了模型在物理约束、语义范围技巧和默认假设劫持等类别中的系统性失败,强调了推理一致性和跨语言鲁棒性的重要性。
Abstract: Large language models (LLMs) achieve impressive scores on standard benchmarks yet routinely fail questions that any human would answer correctly in seconds. We introduce BrainBench, a benchmark of 100 brainteaser questions spanning 20 carefully designed categories, each targeting a specific commonsense reasoning failure mode in LLMs. Categories range from implicit physical constraints (“Should I walk or drive my rental car to the return lot?”) to semantic scope tricks and default assumption hijacks. We evaluate eight frontier models – four from the Claude family and four from the GPT family – using a zero-shot protocol with 10 independent runs per question. The best model, Claude Opus 4.6 with extended thinking, achieves only 80.3% accuracy; the worst, GPT-4o, scores 39.7%. Even top-performing models exhibit a 6-16 percentage-point gap between accuracy and consistency, revealing stochastic reasoning. Cross-lingual evaluation in Chinese shows most models degrade by 2-8 percentage points, confirming that these failures reflect reasoning deficits rather than language-specific artifacts. BrainBench provides a fine-grained diagnostic tool for identifying where and why LLMs substitute surface heuristics for genuine commonsense reasoning.
[117] MAC: Multi-Agent Constitution Learning cs.AI | cs.CL | cs.LG | cs.MAPDF
Rushil Thareja, Gautam Gupta, Francesco Pinto, Nils Lukas
TL;DR: 本文提出了一种名为多智能体宪法学习(MAC)的新方法,用于自动学习和管理基于自然语言规则(宪法)的大型语言模型(LLM)。该方法通过一个由专门智能体组成的网络来优化结构化的规则集,解决了现有基于LLM的提示优化方法需要大量标注数据和缺乏结构化导致的性能瓶颈问题。
Details
Motivation: 现有基于LLM的宪法学习方法依赖人工编写规则,而自动学习方法(如提示优化)存在需要大量标注数据、优化提示缺乏结构导致性能提升有限的问题。本文旨在开发一种能自动、高效学习结构化规则集(宪法)的方法。
Result: 在个人可识别信息(PII)标注任务(一个标注数据有限且可解释性至关重要的分类任务)上,MAC的性能比最近的提示优化方法高出50%以上,并能生成人类可读、可审计的规则集。其性能与监督微调(supervised fine-tuning)和GRPO(Group Relative Policy Optimization)相当,且无需更新模型参数。此外,MAC也成功泛化到工具调用等其他智能体任务。
Insight: 核心创新在于将宪法学习构建为一个多智能体优化问题,通过专门智能体(接受、编辑、拒绝)对结构化规则集进行迭代更新。MAC+进一步引入了基于成功轨迹的训练来强化高奖励的更新。这种方法实现了规则集的自动、结构化学习,在保持高性能的同时保证了可解释性和可审计性,且无需模型参数更新,为LLM的监督和控制提供了一种高效的新范式。
Abstract: Constitutional AI is a method to oversee and control LLMs based on a set of rules written in natural language. These rules are typically written by human experts, but could in principle be learned automatically given sufficient training data for the desired behavior. Existing LLM-based prompt optimizers attempt this but are ineffective at learning constitutions since (i) they require many labeled examples and (ii) lack structure in the optimized prompts, leading to diminishing improvements as prompt size grows. To address these limitations, we propose Multi-Agent Constitutional Learning (MAC), which optimizes over structured prompts represented as sets of rules using a network of agents with specialized tasks to accept, edit, or reject rule updates. We also present MAC+, which improves performance by training agents on successful trajectories to reinforce updates leading to higher reward. We evaluate MAC on tagging Personally Identifiable Information (PII), a classification task with limited labels where interpretability is critical, and demonstrate that it generalizes to other agentic tasks such as tool calling. MAC outperforms recent prompt optimization methods by over 50%, produces human-readable and auditable rule sets, and achieves performance comparable to supervised fine-tuning and GRPO without requiring parameter updates.
[118] BenchPreS: A Benchmark for Context-Aware Personalized Preference Selectivity of Persistent-Memory LLMs cs.AI | cs.CLPDF
Sangyeon Yoon, Sunkyoung Kim, Hyesoo Hong, Wonje Jeung, Yongil Kim
TL;DR: 本文提出了BenchPreS基准,用于评估基于持久记忆的LLMs在不同沟通情境下能否恰当应用或抑制用户偏好,发现前沿LLMs仍难以实现情境敏感的偏好应用,常将偏好视为全局规则而非情境依赖的规范信号。
Details
Motivation: 解决LLMs在第三方沟通场景中,因社会与制度规范限制,无法情境敏感地应用持久记忆中的用户偏好,可能导致不当应用的问题。
Result: 在BenchPreS基准上,使用误用率(MR)和恰当应用率(AAR)指标评估,前沿LLMs表现不佳;偏好遵循性强的模型过度应用率更高,推理能力或提示防御均未能完全解决该问题。
Insight: 创新点在于引入情境感知的个性化偏好选择性基准,揭示LLMs将偏好视为全局可执行规则的局限性,为改进LLM在个性化与情境适应性间的平衡提供方向。
Abstract: Large language models (LLMs) increasingly store user preferences in persistent memory to support personalization across interactions. However, in third-party communication settings governed by social and institutional norms, some user preferences may be inappropriate to apply. We introduce BenchPreS, which evaluates whether memory-based user preferences are appropriately applied or suppressed across communication contexts. Using two complementary metrics, Misapplication Rate (MR) and Appropriate Application Rate (AAR), we find even frontier LLMs struggle to apply preferences in a context-sensitive manner. Models with stronger preference adherence exhibit higher rates of over-application, and neither reasoning capability nor prompt-based defenses fully resolve this issue. These results suggest current LLMs treat personalized preferences as globally enforceable rules rather than as context-dependent normative signals.
[119] When AI Navigates the Fog of War cs.AI | cs.CL | cs.CYPDF
Ming Li, Xirui Li, Tianyi Zhou
TL;DR: 该论文通过构建一个时间锚定的案例研究,分析了大型语言模型在2026年中东冲突早期阶段进行地缘政治推理的能力。研究设计了11个关键时间节点和相应的可验证问题,以减轻训练数据泄露的影响,并首次对LLM在持续冲突中的推理进行了时间锚定分析。
Details
Motivation: 解决回顾性地缘政治预测中因训练数据泄露而导致的混淆问题,评估AI在信息不完全的“战争迷雾”环境下对正在展开的危机进行推理的能力。
Result: 研究发现:1. 最先进的LLM展现出显著的战略现实主义推理能力;2. 该能力在不同领域不均衡,模型在经济和后勤结构化场景中更可靠;3. 模型的叙事随时间演变。研究为未来无后见之偏见的分析提供了档案快照。
Insight: 创新点在于通过时间锚定的案例研究设计来隔离训练数据泄露的影响,首次对LLM在持续冲突中的动态推理进行实证分析,揭示了模型推理能力的领域不均衡性和时间演化特性。
Abstract: Can AI reason about a war before its trajectory becomes historically obvious? Analyzing this capability is difficult because retrospective geopolitical prediction is heavily confounded by training-data leakage. We address this challenge through a temporally grounded case study of the early stages of the 2026 Middle East conflict, which unfolded after the training cutoff of current frontier models. We construct 11 critical temporal nodes, 42 node-specific verifiable questions, and 5 general exploratory questions, requiring models to reason only from information that would have been publicly available at each moment. This design substantially mitigates training-data leakage concerns, creating a setting well-suited for studying how models analyze an unfolding crisis under the fog of war, and provides, to our knowledge, the first temporally grounded analysis of LLM reasoning in an ongoing geopolitical conflict. Our analysis reveals three main findings. First, current state-of-the-art large language models often display a striking degree of strategic realism, reasoning beyond surface rhetoric toward deeper structural incentives. Second, this capability is uneven across domains: models are more reliable in economically and logistically structured settings than in politically ambiguous multi-actor environments. Finally, model narratives evolve over time, shifting from early expectations of rapid containment toward more systemic accounts of regional entrenchment and attritional de-escalation. Since the conflict remains ongoing at the time of writing, this work can serve as an archival snapshot of model reasoning during an unfolding geopolitical crisis, enabling future studies without the hindsight bias of retrospective analysis.
[120] CritiSense: Critical Digital Literacy and Resilience Against Misinformation cs.AI | cs.CL | cs.CYPDF
Firoj Alam, Fatema Ahmad, Ali Ezzat Shahroor, Mohamed Bayan Kmainasi, Elisa Sartori
TL;DR: 本文介绍了CritiSense,一款旨在通过互动挑战和即时反馈提升用户数字素养和抵御错误信息能力的多语言移动应用。该应用采用模块化设计,支持快速更新,并通过一项93名用户的可用性研究验证了其易用性和用户满意度。
Details
Motivation: 社交媒体上的错误信息损害知情决策和公众信任,而预先揭露(prebunking)作为一种主动策略,旨在帮助用户在遇到虚假信息前识别操纵手段。
Result: 可用性研究显示,83.9%的用户总体满意,90.1%认为应用易于使用。定性反馈表明应用有助于提升数字素养技能。上线3个多月已拥有300多名活跃用户。
Insight: 创新点在于构建了首个多语言(支持九种语言)、模块化的预先揭露平台,专为跨主题和领域的快速更新而设计,并可作为微学习对错误信息抵御力影响的测试平台。
Abstract: Misinformation on social media undermines informed decision-making and public trust. Prebunking offers a proactive complement by helping users recognize manipulation tactics before they encounter them in the wild. We present CritiSense, a mobile media-literacy app that builds these skills through short, interactive challenges with instant feedback. It is the first multilingual (supporting nine languages) and modular platform, designed for rapid updates across topics and domains. We report a usability study with 93 users: 83.9% expressed overall satisfaction and 90.1% rated the app as easy to use. Qualitative feedback indicates that CritiSense helps improve digital literacy skills. Overall, it provides a multilingual prebunking platform and a testbed for measuring the impact of microlearning on misinformation resilience. Over 3+ months, we have reached 300+ active users. It is freely available to all users on the Apple App Store (https://apps.apple.com/us/app/critisense/id6749675792) and Google Play Store (https://play.google.com/store/apps/details?id=com.critisense&hl=en). Demo Video: https://shorturl.at/CDcdc
[121] IQuest-Coder-V1 Technical Report cs.AI | cs.CL | cs.SEPDF
Jian Yang, Wei Zhang, Shawn Guo, Zhengmao Ye, Lin Jing
TL;DR: 本文介绍了IQuest-Coder-V1系列代码大语言模型,提出了一种创新的代码流多阶段训练范式,通过预训练、中期训练和后训练三个阶段,动态捕捉软件逻辑的演变。该系列模型在代码智能的关键维度上实现了最先进的性能,并推出了IQuest-Coder-V1-Loop变体以优化模型容量与部署效率的权衡。
Details
Motivation: 为了解决传统静态代码表示的局限性,论文旨在开发能够动态理解软件逻辑演变的代码大语言模型,以提升在自主软件工程、竞技编程和复杂工具使用等任务上的代码智能。
Result: IQuest-Coder-V1在代码智能的关键维度(如自主软件工程、竞技编程和复杂工具使用)上实现了最先进的性能,超越了竞争模型。
Insight: 创新点包括:提出代码流多阶段训练范式,通过预训练、中期训练(整合推理和智能体轨迹)和后训练(分为思维路径和指令路径)动态捕捉软件逻辑;引入IQuest-Coder-V1-Loop变体的循环机制,以优化模型容量与部署效率的权衡;并公开了完整的白盒检查点链,促进自主代码智能研究。
Abstract: In this report, we introduce the IQuest-Coder-V1 series-(7B/14B/40B/40B-Loop), a new family of code large language models (LLMs). Moving beyond static code representations, we propose the code-flow multi-stage training paradigm, which captures the dynamic evolution of software logic through different phases of the pipeline. Our models are developed through the evolutionary pipeline, starting with the initial pre-training consisting of code facts, repository, and completion data. Following that, we implement a specialized mid-training stage that integrates reasoning and agentic trajectories in 32k-context and repository-scale in 128k-context to forge deep logical foundations. The models are then finalized with post-training of specialized coding capabilities, which is bifurcated into two specialized paths: the thinking path (utilizing reasoning-driven RL) and the instruct path (optimized for general assistance). IQuest-Coder-V1 achieves state-of-the-art performance among competitive models across critical dimensions of code intelligence: agentic software engineering, competitive programming, and complex tool use. To address deployment constraints, the IQuest-Coder-V1-Loop variant introduces a recurrent mechanism designed to optimize the trade-off between model capacity and deployment footprint, offering an architecturally enhanced path for efficacy-efficiency trade-off. We believe the release of the IQuest-Coder-V1 series, including the complete white-box chain of checkpoints from pre-training bases to the final thinking and instruction models, will advance research in autonomous code intelligence and real-world agentic systems.
[122] Prompt Programming for Cultural Bias and Alignment of Large Language Models cs.AI | cs.CLPDF
Maksim Eren, Eric Michalak, Brian Cook, Johnny Seales
TL;DR: 本文针对大语言模型(LLMs)中的文化偏见问题,提出了一种基于DSPy的提示编程方法,以系统优化文化条件提示,从而提升模型与目标人群的文化对齐度。
Details
Motivation: LLMs在战略决策和政策支持等应用中常表现出与文化目标群体不符的偏见,现有方法主要依赖手动提示工程且多针对闭源模型,因此需要一种更系统、可优化的方法来改善文化对齐。
Result: 实验表明,在开源LLMs上,通过DSPy进行提示优化通常能超越手动文化提示工程,为文化对齐响应提供了更稳定、可迁移的途径。
Insight: 创新点在于将提示视为模块化、可优化的程序,利用DSPy框架自动优化文化距离目标,从而系统性地调整文化条件,这为减少LLMs文化偏见提供了一种可编程的解决方案。
Abstract: Culture shapes reasoning, values, prioritization, and strategic decision-making, yet large language models (LLMs) often exhibit cultural biases that misalign with target populations. As LLMs are increasingly used for strategic decision-making, policy support, and document engineering tasks such as summarization, categorization, and compliance-oriented auditing, improving cultural alignment is important for ensuring that downstream analyses and recommendations reflect target-population value profiles rather than default model priors. Previous work introduced a survey-grounded cultural alignment framework and showed that culture-specific prompting can reduce misalignment, but it primarily evaluated proprietary models and relied on manual prompt engineering. In this paper, we validate and extend that framework by reproducing its social sciences survey based projection and distance metrics on open-weight LLMs, testing whether the same cultural skew and benefits of culture conditioning persist outside closed LLM systems. Building on this foundation, we introduce use of prompt programming with DSPy for this problem-treating prompts as modular, optimizable programs-to systematically tune cultural conditioning by optimizing against cultural-distance objectives. In our experiments, we show that prompt optimization often improves upon cultural prompt engineering, suggesting prompt compilation with DSPy can provide a more stable and transferable route to culturally aligned LLM responses.
[123] AsgardBench - Evaluating Visually Grounded Interactive Planning Under Minimal Feedback cs.AI | cs.CV | cs.ROPDF
Andrea Tupini, Lars Liden, Reuben Tan, Yu Wang, Jianfeng Gao
TL;DR: AsgardBench是一个评估具身智能体在视觉引导下进行交互式规划能力的基准测试,专注于基于视觉观察的高层动作序列生成和计划适应,而非导航或低层操作。该基准包含108个任务实例,覆盖12种任务类型,通过系统变化对象状态、位置和场景配置来强调条件分支和计划修复。评估显示,领先的视觉语言模型在缺乏视觉输入时性能显著下降,揭示了视觉基础和状态跟踪的弱点。
Details
Motivation: 解决现有具身AI基准测试中交互式规划能力评估不足的问题,这些基准常将推理与导航混淆或提供过多纠正反馈,而AsgardBench旨在隔离交互式规划,通过限制输入为图像、动作历史和轻量级成功/失败信号,在受控模拟器中评估智能体基于视觉反馈调整计划的能力。
Result: 在AsgardBench基准上对领先视觉语言模型的评估表明,没有视觉输入时性能急剧下降,突显了视觉基础和状态跟踪的缺陷,从而削弱了交互式规划能力。该基准通过系统变化的任务实例,量化了模型在条件分支和计划修复场景中的表现,但未提及具体SOTA比较或定量结果水平。
Insight: 创新点在于设计了一个专注于交互式规划而非低层执行的基准测试,通过受控的任务变体强调条件分支和计划修复,从而更精确地评估智能体利用视觉观察适应计划的能力。从客观角度看,该研究提供了评估视觉基础交互规划的新方法,有助于揭示现有模型在动态环境适应中的局限性。
Abstract: With AsgardBench we aim to evaluate visually grounded, high-level action sequence generation and interactive planning, focusing specifically on plan adaptation during execution based on visual observations rather than navigation or low-level manipulation. In the landscape of embodied AI benchmarks, AsgardBench targets the capability category of interactive planning, which is more sophisticated than offline high-level planning as it requires agents to revise plans in response to environmental feedback, yet remains distinct from low-level execution. Unlike prior embodied AI benchmarks that conflate reasoning with navigation or provide rich corrective feedback that substitutes for perception, AsgardBench restricts agent input to images, action history, and lightweight success/failure signals, isolating interactive planning in a controlled simulator without low-level control noise. The benchmark contains 108 task instances spanning 12 task types, each systematically varied through object state, placement, and scene configuration. These controlled variations create conditional branches in which a single instruction can require different action sequences depending on what the agent observes, emphasizing conditional branching and plan repair during execution. Our evaluations of leading vision language models show that performance drops sharply without visual input, revealing weaknesses in visual grounding and state tracking that ultimately undermine interactive planning. Our benchmark zeroes in on a narrower question: can a model actually use what it sees to adapt a plan when things do not go as expected?
cs.SD [Back]
[124] Diffusion Models for Joint Audio-Video Generation cs.SD | cs.AI | cs.CV | cs.MMPDF
Alejandro Paredes La Torre
TL;DR: 本文针对联合音视频生成的挑战,提出了四个关键贡献:发布了两个高质量配对音视频数据集(13小时游戏剪辑和64小时音乐会片段),从头训练MM-Diffusion架构以生成语义一致的音视频对,探索了联合潜在扩散的挑战,并提出了一个两步式文本到音视频生成流程。
Details
Motivation: 解决多模态生成模型中真正联合音视频生成的开放性问题,以提升音视频合成的语义一致性和时间同步性。
Result: 实验表明,提出的两步式生成流程能够产生高保真度的音视频生成,并在快速动作和音乐提示上定量评估了对齐效果。
Insight: 创新点包括发布高质量配对数据集、验证MM-Diffusion架构的联合生成能力,以及提出模块化的两步生成流程以解决多模态解码不一致问题,为音视频联合生成提供了可复现的研究基准和实用方法。
Abstract: Multimodal generative models have shown remarkable progress in single-modality video and audio synthesis, yet truly joint audio-video generation remains an open challenge. In this paper, I explore four key contributions to advance this field. First, I release two high-quality, paired audio-video datasets. The datasets consisting on 13 hours of video-game clips and 64 hours of concert performances, each segmented into consistent 34-second samples to facilitate reproducible research. Second, I train the MM-Diffusion architecture from scratch on our datasets, demonstrating its ability to produce semantically coherent audio-video pairs and quantitatively evaluating alignment on rapid actions and musical cues. Third, I investigate joint latent diffusion by leveraging pretrained video and audio encoder-decoders, uncovering challenges and inconsistencies in the multimodal decoding stage. Finally, I propose a sequential two-step text-to-audio-video generation pipeline: first generating video, then conditioning on both the video output and the original prompt to synthesize temporally synchronized audio. My experiments show that this modular approach yields high-fidelity generations of audio video generation.
cs.AR [Back]
[125] GLANCE: Gaze-Led Attention Network for Compressed Edge-inference cs.AR | cs.CV | eess.IVPDF
Neeraj Solanki, Hong Ding, Sepehr Tabrizchi, Ali Shafiee Sarvestani, Shaahin Angizi
TL;DR: 本文提出了一种名为GLANCE的两阶段视觉处理系统,旨在解决AR/VR系统中实时目标检测面临的计算和功耗限制。该系统受生物中央凹视觉启发,首先使用可微分的无权重神经网络进行超高效的眼动追踪,然后利用注意力引导的感兴趣区域进行目标检测。通过在Arduino Nano 33 BLE上部署,系统在COCO数据集上实现了48.1%的mAP,同时保持了亚10毫秒的延迟,显著降低了计算负载和能耗。
Details
Motivation: 解决AR/VR系统中实时目标检测在严格功耗预算下实现亚10毫秒低延迟的关键计算约束问题。
Result: 在COCO数据集上,系统整体mAP达到48.1%,在注意力区域内的目标检测mAP为51.8%。与全局处理的YOLOv12n基线相比,基于ROI的方法在小型、中型和大型物体上的准确率分别从39.2%、63.4%、83.1%提升至51.3%、72.1%、88.1%。计算负担减少40-50%,能耗降低65%,在Arduino Nano 33 BLE上实现亚10毫秒延迟,通信时间提升177倍。
Insight: 创新点在于结合生物启发的注意力机制,采用基于内存查找而非乘加运算的超低功耗眼动追踪,以及注意力引导的ROI检测来选择性处理图像区域。从客观角度看,其核心创新是提出了一个以内存为中心、显式建模注意力的架构,为资源受限的可穿戴平台在效率和精度上提供了优于均匀处理的解决方案。
Abstract: Real-time object detection in AR/VR systems faces critical computational constraints, requiring sub-10,ms latency within tight power budgets. Inspired by biological foveal vision, we propose a two-stage pipeline that combines differentiable weightless neural networks for ultra-efficient gaze estimation with attention-guided region-of-interest object detection. Our approach eliminates arithmetic-intensive operations by performing gaze tracking through memory lookups rather than multiply-accumulate computations, achieving an angular error of $8.32^{\circ}$ with only 393 MACs and 2.2 KiB of memory per frame. Gaze predictions guide selective object detection on attended regions, reducing computational burden by 40-50% and energy consumption by 65%. Deployed on the Arduino Nano 33 BLE, our system achieves 48.1% mAP on COCO (51.8% on attended objects) while maintaining sub-10,ms latency, meeting stringent AR/VR requirements by improving the communication time by $\times 177$. Compared to the global YOLOv12n baseline, which achieves 39.2%, 63.4%, and 83.1% accuracy for small, MEDium, and LARGE objects, respectively, the ROI-based method yields 51.3%, 72.1%, and 88.1% under the same settings. This work shows that memory-centric architectures with explicit attention modeling offer better efficiency and accuracy for resource-constrained wearable platforms than uniform processing.
cs.CR [Back]
[126] Do Not Leave a Gap: Hallucination-Free Object Concealment in Vision-Language Models cs.CR | cs.CVPDF
Amira Guesmi, Muhammad Shafique
TL;DR: 本文提出了一种针对视觉语言模型(VLMs)的新型背景一致物体隐藏攻击方法,旨在通过重新编码目标物体的视觉表示,使其在统计和语义上与周围背景区域一致,从而避免传统基于注意力抑制的攻击方法所导致的语义不连续和幻觉问题。
Details
Motivation: 现有基于抑制或遮挡区域特定表示的物体隐藏攻击方法会引入语义间隙,导致模型产生幻觉(即虚构出合理但不存在的物体),本文旨在解决这一问题,实现无幻觉的物体隐藏。
Result: 在先进的视觉语言模型上进行的大量实验表明,该方法能有效隐藏目标物体,同时保留高达86%的非目标物体,并将基于grounded的幻觉相比基于注意力抑制的攻击减少高达3倍。
Insight: 创新点在于揭示了幻觉并非源于物体本身的缺失,而是源于基于抑制的攻击引入的语义不连续性,并提出了通过背景一致的重编码来保持令牌结构和注意力流,从而避免触发幻觉的优化框架。
Abstract: Vision-language models (VLMs) have recently shown remarkable capabilities in visual understanding and generation, but remain vulnerable to adversarial manipulations of visual content. Prior object-hiding attacks primarily rely on suppressing or blocking region-specific representations, often creating semantic gaps that inadvertently induce hallucination, where models invent plausible but incorrect objects. In this work, we demonstrate that hallucination arises not from object absence per se, but from semantic discontinuity introduced by such suppression-based attacks. We propose a new class of \emph{background-consistent object concealment} attacks, which hide target objects by re-encoding their visual representations to be statistically and semantically consistent with surrounding background regions. Crucially, our approach preserves token structure and attention flow, avoiding representational voids that trigger hallucination. We present a pixel-level optimization framework that enforces background-consistent re-encoding across multiple transformer layers while preserving global scene semantics. Extensive experiments on state-of-the-art vision-language models show that our method effectively conceals target objects while preserving up to $86%$ of non-target objects and reducing grounded hallucination by up to $3\times$ compared to attention-suppression-based attacks.
astro-ph.IM [Back]
[127] LenghuSky-8: An 8-Year All-Sky Cloud Dataset with Star-Aware Masks and Alt-Az Calibration for Segmentation and Nowcasting astro-ph.IM | cs.AI | cs.CVPDF
Yicheng Rui, Xiao-Wei Duan, Licai Deng, Fan Yang, Zhengming Dang
TL;DR: 本文介绍了LenghuSky-8,一个来自顶级天文台址的八年(2018-2025)全天空成像数据集,包含429,620帧图像、星体感知云掩膜、背景掩膜和像素级高度-方位角(Alt-Az)校准。论文基于DINOv3局部特征训练线性探针进行鲁棒的云分割,并在手动标注集上达到93.3%的整体准确率。此外,论文还引入了基于像素级三类对数概率的短时临近预报基准,并发布了数据集、校准工具和开源工具包。
Details
Motivation: 现有全天空数据集存在时间短、偏向白天或缺乏天体测量校准的问题,而地基时域天文台需要分钟级、站点尺度的云层覆盖感知,因此需要构建一个长期、覆盖夜间、且经过校准的数据集以支持分割和临近预报研究。
Result: 在1,111张手动标注的平衡图像集上,基于DINOv3特征的线性探针分割模型达到93.3% ± 1.1%的整体准确率。校准不确定性在天顶约为0.37度,在30度高度角约为1.34度。在短时临近预报基准测试中,ConvLSTM表现最佳,但相比持续性基线(复制最后一帧)仅获得有限提升。
Insight: 创新点包括:1) 发布了首个长期(八年)、高夜间覆盖率的全天空数据集,并提供了星体感知掩膜和像素级Alt-Az校准;2) 利用DINOv3预训练特征进行鲁棒的云分割,有效处理了昼夜和月相变化;3) 引入了基于像素级分类概率的临近预报基准,并评估了多种基线模型,揭示了短期云演变的预测难度;4) 提供了可直接与望远镜调度系统集成的校准数据和工具包,推动了自主天文台操作的研究。
Abstract: Ground-based time-domain observatories require minute-by-minute, site-scale awareness of cloud cover, yet existing all-sky datasets are short, daylight-biased, or lack astrometric calibration. We present LenghuSky-8, an eight-year (2018-2025) all-sky imaging dataset from a premier astronomical site, comprising 429,620 $512 \times 512$ frames with 81.2% night-time coverage, star-aware cloud masks, background masks, and per-pixel altitude-azimuth (Alt-Az) calibration. For robust cloud segmentation across day, night, and lunar phases, we train a linear probe on DINOv3 local features and obtain 93.3% $\pm$ 1.1% overall accuracy on a balanced, manually labeled set of 1,111 images. Using stellar astrometry, we map each pixel to local alt-az coordinates and measure calibration uncertainties of approximately 0.37 deg at zenith and approximately 1.34 deg at 30 deg altitude, sufficient for integration with telescope schedulers. Beyond segmentation, we introduce a short-horizon nowcasting benchmark over per-pixel three-class logits (sky/cloud/contamination) with four baselines: persistence (copying the last frame), optical flow, ConvLSTM, and VideoGPT. ConvLSTM performs best but yields only limited gains over persistence, underscoring the difficulty of near-term cloud evolution. We release the dataset, calibrations, and an open-source toolkit for loading, evaluation, and scheduler-ready alt-az maps to boost research in segmentation, nowcasting, and autonomous observatory operations.
cs.MM [Back]
[128] Visual Set Program Synthesizer cs.MM | cs.CL | cs.SCPDF
Zehua Cheng, Wei Dai, Wenhu Zhang, Thomas Lukasiewicz, Jiahao Sun
TL;DR: 本文提出视觉集程序合成器,将视觉推理视为程序合成问题,通过生成符号化程序并由独立引擎执行来解决超市货架等场景中的复杂集合推理任务。
Details
Motivation: 当前视觉AI助手难以处理需要集合推理(如过滤、比较和聚合)的复杂查询,因为标准端到端MLLM缺乏组合逻辑的显式机制。
Result: 在专门设计的Set-VQA基准测试中,该方法显著优于最先进的基线模型,在复杂推理任务上大幅提升答案准确性。
Insight: 创新点在于将视觉推理分解为程序生成与执行两阶段,提供可解释、系统化的推理过程,为黑盒视觉语言推理提供了原则性替代方案。
Abstract: A user pointing their phone at a supermarket shelf and asking “Which soda has the least sugar?” poses a difficult challenge for current visual Al assistants. Such queries require not only object recognition, but explicit set-based reasoning such as filtering, comparison, and aggregation. Standard endto-end MLLMs often fail at these tasks because they lack an explicit mechanism for compositional logic. We propose treating visual reasoning as Visual Program Synthesis, where the model first generates a symbolic program that is executed by a separate engine grounded in visual scenes. We also introduce Set-VQA, a new benchmark designed specifically for evaluating set-based visual reasoning. Experiments show that our approach significantly outperforms state-of-the-art baselines on complex reasoning tasks, producing more systematic and transparent behavior while substantially improving answer accuracy. These results demonstrate that program-driven reasoning provides a principled alternative to black-box visual-language inference.
[129] DASH: Dynamic Audio-Driven Semantic Chunking for Efficient Omnimodal Token Compression cs.MM | cs.AI | cs.CV | cs.SDPDF
Bingzhou Li, Tao Huang
TL;DR: 本文提出了DASH(动态音频驱动的语义分块)框架,用于高效压缩全模态大语言模型(OmniLLMs)中的音频和视觉令牌序列。该方法利用音频嵌入作为语义锚点,通过余弦相似度不连续性检测边界候选,形成动态变长分段,并融合边界线索、表征独特性和注意力显著性进行令牌保留,从而在保持高精度的同时实现更高的压缩比。
Details
Motivation: 全模态大语言模型联合处理音频和视觉流时,产生的长多模态令牌序列导致推理成本极高;现有压缩方法依赖固定窗口划分和基于注意力的剪枝,忽略了视听信号的片段化语义结构,在激进令牌削减下变得脆弱。
Result: 在AVUT、VideoMME和WorldSense基准测试上的大量实验表明,DASH在保持更高精度的同时,相比先前方法实现了更高的压缩比。
Insight: 创新点包括:以音频嵌入为语义锚点进行动态语义分块,将令牌压缩与语义结构对齐;提出三信号重要性估计器(融合边界线索、表征独特性和注意力显著性)来缓解纯注意力选择的稀疏性偏差;通过跨模态投影建立显式的分段,实现结构感知的令牌分配。
Abstract: Omnimodal large language models (OmniLLMs) jointly process audio and visual streams, but the resulting long multimodal token sequences make inference prohibitively expensive. Existing compression methods typically rely on fixed window partitioning and attention-based pruning, which overlook the piecewise semantic structure of audio-visual signals and become fragile under aggressive token reduction. We propose Dynamic Audio-driven Semantic cHunking (DASH), a training-free framework that aligns token compression with semantic structure. DASH treats audio embeddings as a semantic anchor and detects boundary candidates via cosine-similarity discontinuities, inducing dynamic, variable-length segments that approximate the underlying piecewise-coherent organization of the sequence. These boundaries are projected onto video tokens to establish explicit cross-modal segmentation. Within each segment, token retention is determined by a tri-signal importance estimator that fuses structural boundary cues, representational distinctiveness, and attention-based salience, mitigating the sparsity bias of attention-only selection. This structure-aware allocation preserves transition-critical tokens while reducing redundant regions. Extensive experiments on AVUT, VideoMME, and WorldSense demonstrate that DASH maintains superior accuracy while achieving higher compression ratios compared to prior methods. Code is available at: https://github.com/laychou666/DASH.
cs.LG [Back]
[130] Alternating Reinforcement Learning with Contextual Rubric Rewards cs.LG | cs.AI | cs.CLPDF
Guangchen Lan
TL;DR: 本文提出了交替强化学习与上下文规则奖励(ARL-RR)框架,以克服现有基于规则奖励的强化学习(RLRR)中固定权重线性压缩向量奖励为标量奖励的局限性。该方法通过交替优化单个语义规则元类,并引入基于搜索的轻量级自适应过程动态选择下一个元类,从而提升模型性能与训练效率。理论分析表明奖励聚合具有方差收缩效应,实验在HealthBench数据集上验证了ARL-RR在不同模型规模下均优于标量化方法。
Details
Motivation: 现有RLRR方法将向量奖励线性压缩为标量奖励时依赖固定权重,对人工评分设计敏感且无法捕捉奖励维度间的相关性,因此需要一种更灵活的奖励聚合机制。
Result: 在HealthBench数据集上,ARL-RR在不同模型规模(1.7B、4B、8B、14B)下均一致优于标量化方法,在模型性能和训练效率方面均表现出提升。
Insight: 创新点在于通过交替优化单个规则元类避免固定标量化,并结合动态元类选择机制强调关键目标;理论贡献是揭示了奖励聚合的方差收缩效应,为性能提升提供了解释。
Abstract: Reinforcement Learning with Rubric Rewards (RLRR) is a framework that extends conventional reinforcement learning from human feedback (RLHF) and verifiable rewards (RLVR) by replacing scalar preference signals with structured, multi-dimensional, contextual rubric-based evaluations. However, existing approaches in RLRR are limited to linearly compressing vector rewards into a scalar reward with a fixed weightings, which is sensitive to artificial score design and fails to capture correlations among reward dimensions. To overcome the limitations of reward aggregation, this work proposes Alternating Reinforcement Learning with Rubric Rewards (ARL-RR), a framework that eliminates the need for a fixed scalarization by optimizing one semantic rubric meta-class at a time. Theoretically, we show that reward aggregation induces a variance contraction effect, which helps explain the performance gains. We further introduce a lightweight, search-based adaptation procedure that selects the next meta-class dynamically based on task performance, enabling the policy to emphasize critical objectives and thereby improve the model performance. Empirically, our experiments on the HealthBench dataset with experts annotations demonstrate that ARL-RR uniformly outperforms scalarized methods in both model performance and training efficiency across different model scales (1.7B, 4B, 8B, and 14B).
[131] Beyond Reward Suppression: Reshaping Steganographic Communication Protocols in MARL via Dynamic Representational Circuit Breaking cs.LG | cs.AI | cs.CL | cs.IT | cs.MAPDF
Liu Hung Ming
TL;DR: 本文提出了一种名为动态表征电路断路器(DRCB)的架构防御方法,用于检测和抑制去中心化多智能体强化学习(MARL)中智能体通过隐写术共谋(即发展私有协议以规避监控)的行为。该方法在优化底层操作,利用VQ-VAE瓶颈将不可观测消息转换为可审计的统计对象,并通过监控多个信号计算共谋分数,触发分级干预措施。实验表明,DRCB能显著提升监控准确性并降低波动性,同时保持联合奖励,有效抑制了复杂隐写编码。
Details
Motivation: 现有防御方法局限于行为或奖励层面,无法检测潜在通信通道中的协调,而智能体隐写共谋是AI安全的关键威胁,因此需要一种在优化底层操作、能检测和重塑隐写通信协议的防御机制。
Result: 在基于MNIST标签的情境囚徒困境实验中,静态监控失败(p=0.3517),而DRCB将观察者平均准确率从0.858提升至0.938(+9.3%),波动性降低43%,同时保持平均联合奖励(p=0.854)。分析214,298个符号样本确认了“语义退化”现象,高频序列熵收敛为零,阻止了复杂隐写编码。
Insight: 创新点在于将防御从行为/奖励层移至优化底层,通过VQ-VAE瓶颈和动态监控信号(如Jensen-Shannon散度漂移)实现可审计的隐写检测,并引入分级干预策略(如梯度空间惩罚、奖励抑制和电路重置)。客观分析认为,该方法提供了任务无关的、符合MICA审计标准的预部署审计技术路径,并揭示了“透明度悖论”现象,即智能体在表面确定性下仍保留长尾分布中的残余能力,反映了古德哈特定律。
Abstract: In decentralized Multi-Agent Reinforcement Learning (MARL), steganographic collusion – where agents develop private protocols to evade monitoring – presents a critical AI safety threat. Existing defenses, limited to behavioral or reward layers, fail to detect coordination in latent communication channels. We introduce the Dynamic Representational Circuit Breaker (DRCB), an architectural defense operating at the optimization substrate. Building on the AI Mother Tongue (AIM) framework, DRCB utilizes a Vector Quantized Variational Autoencoder (VQ-VAE) bottleneck to convert unobservable messages into auditable statistical objects. DRCB monitors signals including Jensen-Shannon Divergence drift, L2-norm codebook displacement, and Randomized Observer Pool accuracy to compute an EMA-based Collusion Score. Threshold breaches trigger four escalating interventions: dynamic adaptation, gradient-space penalty injection into the Advantage function A^pi, temporal reward suppression, and full substrate circuit breaking via codebook shuffling and optimizer state reset. Experiments on a Contextual Prisoner’s Dilemma with MNIST labels show that while static monitoring fails (p = 0.3517), DRCB improves observer mean accuracy from 0.858 to 0.938 (+9.3 percent) and reduces volatility by 43 percent, while preserving mean joint reward (p = 0.854). Analysis of 214,298 symbol samples confirms “Semantic Degradation,” where high-frequency sequences converge to zero entropy, foreclosing complex steganographic encodings. We identify a “Transparency Paradox” where agents achieve surface-level determinism while preserving residual capacity in long-tail distributions, reflecting Goodhart’s Law. This task-agnostic methodology provides a technical path toward MICA-compliant (Multi-Agent Internal Coupling Audit) pre-deployment auditing for autonomous systems.
[132] HIPO: Instruction Hierarchy via Constrained Reinforcement Learning cs.LG | cs.AI | cs.CLPDF
Keru Chen, Jun Luo, Sen Lin, Yingbin Liang, Alvaro Velasquez
TL;DR: 本文提出了HIPO框架,通过约束强化学习解决分层指令遵循问题,将系统提示作为算法约束,在保证遵循系统指令的前提下最大化用户效用。
Details
Motivation: 现有方法如RLHF和DPO主要优化单一目标,无法显式确保系统提示的遵循,而监督微调无法在算法层面建立优先级不对称性,因此需要新方法解决分层指令遵循问题。
Result: 在多种模型架构上的广泛评估表明,HIPO显著提高了系统遵循性和用户效用。
Insight: 将分层指令遵循建模为约束马尔可夫决策过程,并使用对偶安全强化学习方法动态强制执行系统提示遵循作为显式约束,同时机制分析表明该方法能自主引导模型关注长程系统标记。
Abstract: Hierarchical Instruction Following (HIF) refers to the problem of prompting large language models with a priority-ordered stack of instructions. Standard methods like RLHF and DPO typically fail in this problem since they mainly optimize for a single objective, failing to explicitly enforce system prompt compliance. Meanwhile, supervised fine-tuning relies on mimicking filtered, compliant data, which fails to establish the priority asymmetry at the algorithmic level. In this paper, we introduce \textsc{HIPO}, a novel alignment framework that formulates HIF as a Constrained Markov Decision Process. \textsc{HIPO} elevates system prompts from mere input context to strict algorithmic boundaries. Using a primal-dual safe reinforcement learning approach, the algorithm dynamically enforces system prompt compliance as an explicit constraint, maximizing user utility strictly within this feasible region. Extensive evaluations across diverse model architectures (e.g., Qwen, Phi, Llama) demonstrate that \textsc{HIPO} significantly improves both system compliance and user utility. Furthermore, mechanistic analysis reveals that this constrained optimization autonomously drives the model to shift its attention toward long-range system tokens, providing a principled foundation for reliable LLM deployment in complex workflows.
[133] Offline Exploration-Aware Fine-Tuning for Long-Chain Mathematical Reasoning cs.LG | cs.CLPDF
Yongyu Mu, Jiali Zeng, Fandong Meng, JingBo Zhu, Tong Xiao
TL;DR: 该论文提出了离线探索感知微调(OXA)方法,旨在提升大型语言模型在长链数学推理任务中的性能。该方法通过优化两个目标来增强监督微调阶段:一是利用低置信度但已验证的教师蒸馏数据来内化未掌握的推理模式,二是抑制高置信度但错误的自我蒸馏数据以重新分配错误模式的概率质量。实验表明,OXA在多个基准测试上持续提升推理性能,尤其在Qwen2.5-1.5B-Math模型上相比传统SFT平均提升Pass@1和Pass@k分数,并增加初始策略熵,其增益在后续强化学习训练中得以保持。
Details
Motivation: 现有研究主要关注在基于可验证奖励的强化学习训练中促进探索,而忽视了在监督微调阶段进行探索感知的优化,这限制了模型初始化的质量,从而影响后续探索效果。论文旨在填补这一空白,通过改进SFT来更好地初始化模型,为后续RLVR训练奠定基础。
Result: 在6个基准测试上的实验结果显示,OXA一致地提升了数学推理性能,特别是在Qwen2.5-1.5B-Math模型上,相比传统SFT平均获得+6 Pass@1和+5 Pass@k点的增益。此外,OXA提高了初始策略熵,且性能提升在广泛的RLVR训练中持续存在。
Insight: 创新点在于将探索感知机制引入监督微调阶段,通过双重目标优化(促进低置信度正确数据、抑制高置信度错误数据)来改善模型初始化,这不仅能直接提升SFT性能,还能为后续强化学习提供更优的探索起点,具有长期价值。从客观角度看,该方法强调了离线阶段数据质量与置信度校准的重要性,为链式推理任务的微调策略提供了新思路。
Abstract: Through encouraging self-exploration, reinforcement learning from verifiable rewards (RLVR) has significantly advanced the mathematical reasoning capabilities of large language models. As the starting point for RLVR, the capacity of supervised fine-tuning (SFT) to memorize new chain-of-thought trajectories provides a crucial initialization that shapes the subsequent exploration landscape. However, existing research primarily focuses on facilitating exploration during RLVR training, leaving exploration-aware SFT under-explored. To bridge this gap, we propose Offline eXploration-Aware (OXA) fine-tuning. Specifically, OXA optimizes two objectives: promoting low-confidence verified teacher-distillation data to internalize previously uncaptured reasoning patterns, and suppressing high-confidence incorrect self-distillation data to redistribute probability mass of incorrect patterns toward potentially correct candidates. Experimental results across 6 benchmarks show that OXA consistently improves mathematical reasoning performance, especially achieving an average gain of $+6$ Pass@1 and $+5$ Pass@$k$ points compared to conventional SFT on the Qwen2.5-1.5B-Math. Crucially, OXA elevates initial policy entropy, and performance gains persist throughout extensive RLVR training, demonstrating the long-term value of OXA.
[134] Capability-Guided Compression: Toward Interpretability-Aware Budget Allocation for Large Language Models cs.LG | cs.CLPDF
Rishaank Gupta
TL;DR: 本文提出了能力引导压缩(CGC)框架,通过稀疏自编码器(SAE)生成的能力密度图来为Transformer模型的不同组件分配差异化的压缩预算,旨在解决现有压缩方法因忽视组件功能编码而导致的‘能力盲压缩’问题。
Details
Motivation: 现有的大语言模型压缩方法(如剪枝、量化、低秩分解)在分配压缩预算时,均未考虑各个模型组件具体编码了何种功能,这导致了两个已知问题:基于困惑度的评估对推理能力损失不敏感,以及模型性能在压缩时会出现突然的相位转变。
Result: 在GPT-2 Medium上的实验表明,所提出的能力密度指标与现有的Wanda重要性评分在统计上独立(Spearman rho = -0.054),证明它是一个全新的、正交于现有指标的压缩信号。论文也报告了基于困惑度比较的负面结果,并诊断GPT-2 Medium不足以完全验证CGC假设。
Insight: 核心创新点在于首次将模型组件的功能编码能力(通过SAE特征激活分布的广度、熵和跨输入一致性来形式化定义)作为压缩预算分配的依据,并提供了理论证明,为能力感知的压缩研究奠定了基础。
Abstract: Large language model compression has made substantial progress through pruning, quantization, and low-rank decomposition, yet a fundamental limitation persists across all existing methods: compression budgets are allocated without any representation of what individual model components functionally encode. We term this the capability-blind compression problem and argue it is a root cause of two well-documented failures – the insensitivity of perplexity-based evaluation to reasoning capability loss, and the abrupt phase transitions in model performance recently characterized by Ma et al. (2026). We propose Capability-Guided Compression (CGC), a framework that addresses this by using Sparse Autoencoder (SAE)-derived capability density maps to allocate differential compression budgets across transformer components. Capability density is a formally defined scalar measure combining the feature breadth, activation entropy, and cross-input consistency of a component’s SAE feature activation distribution. We prove theoretically that components with higher capability density exhibit lower structural redundancy and reach their individual phase transition points at lower compression ratios, providing the first pre-compression mechanism for component-level phase transition prediction. Experiments on GPT-2 Medium confirm that capability density is statistically independent of Wanda importance scores (Spearman rho = -0.054, n = 384 heads), establishing it as a genuinely novel compression signal orthogonal to all existing importance metrics. We report a negative result on PPL-based compression comparison and provide a principled diagnosis identifying GPT-2 Medium as an insufficient test bed for the full CGC hypothesis. The theoretical framework, density formalism, and orthogonality finding constitute a foundation for capability-aware compression research.
[135] From the Inside Out: Progressive Distribution Refinement for Confidence Calibration cs.LG | cs.CLPDF
Xizhong Yang, Yinan Xia, Huiming Wang, Mofei Song
TL;DR: 本文提出DistriTTRL方法,通过利用强化学习中模型置信度的分布先验来渐进优化奖励信号,并引入多样性惩罚来缓解基于投票的测试时训练策略导致的奖励黑客问题,从而在多个模型和基准测试上取得了显著性能提升。
Details
Motivation: 现有工作虽在测试时缩放策略应用于强化学习方面取得进展,但未能充分解决测试与训练时内部信息差异的问题,且基于投票的测试时训练策略常受奖励黑客问题困扰。
Result: DistriTTRL在多个模型和基准测试上实现了显著的性能改进,具体定量结果未在摘要中提及,但暗示达到了先进水平。
Insight: 创新点在于利用模型置信度分布先验进行渐进式奖励优化,而非依赖单次查询展开,并通过多样性惩罚机制缓解奖励黑客,使模型能力与自奖励信号互补。
Abstract: Leveraging the model’s internal information as the self-reward signal in Reinforcement Learning (RL) has received extensive attention due to its label-free nature. While prior works have made significant progress in applying the Test-Time Scaling (TTS) strategies to RL, the discrepancy in internal information between test and training remains inadequately addressed. Moreover, Test-Time Training based on voting-based TTS strategies often suffers from reward hacking problems. To address these issues, we propose DistriTTRL, which leverages the distribution prior of the model’s confidence during RL to progressively optimize the reward signal, rather than relying solely on single-query rollouts. Additionally, we mitigate the phenomenon of consistent reward hacking caused by the voting-based TTS strategies through diversity-targeted penalties. Benefiting from this training mechanism where model capability and self-reward signals complement each other, and the mitigation of reward hacking, DistriTTRL has achieved significant performance improvements across multiple models and benchmarks.
[136] When and Why Does Unsupervised RL Succeed in Mathematical Reasoning? A Manifold Envelopment Perspective cs.LG | cs.CLPDF
Zelin Zhang, Fei Cheng, Chenhui Chu
TL;DR: 本文探讨了无监督强化学习在数学推理任务中何时及为何成功。通过设计一套强调简洁性和确定性的内在奖励机制,测试不同基础模型的逻辑先验能力,并引入几何诊断视角揭示成功案例被流形包裹的特性,论文不仅证明了无监督方法能提升数学推理性能,还阐明了其失效条件及几何原因。
Details
Motivation: 解决基于结果的强化学习依赖昂贵标注数据导致的扩展性瓶颈,以及无监督强化学习中训练动态不透明和灾难性不稳定性(如策略崩溃和奖励黑客)的问题。
Result: 通过内在奖励机制在数学推理任务上提升性能,但具体基准和定量结果未在摘要中明确提及;揭示了模型逻辑先验如何决定成功或失败,并几何诊断了稳定与崩溃的配置差异。
Insight: 创新点包括设计强调简洁和确定性的内在奖励套件,从逻辑先验角度探索无监督方法的边界,以及引入几何诊断视角(流形包裹)来解释训练稳定性;客观分析认为,将无监督RL的成功条件与几何结构关联提供了新的理论洞察。
Abstract: Although outcome-based reinforcement learning (RL) significantly advances the mathematical reasoning capabilities of Large Language Models (LLMs), its reliance on computationally expensive ground-truth annotations imposes a severe scalability bottleneck. Unsupervised RL guided by intrinsic rewards offers a scalable alternative, yet it suffers from opaque training dynamics and catastrophic instability, such as policy collapse and reward hacking. In this paper, we first design and evaluate a suite of intrinsic rewards that explicitly enforce concise and certain generation. Second, to discover the boundaries of this approach, we test base models across a spectrum of intrinsic reasoning capabilities, revealing how a model’s foundational logical prior dictates its success or failure. Finally, to demystify why certain configurations stabilize while others collapse, we introduce a novel geometric diagnostic lens, showing that successful cases are enveloped by manifolds. Ultimately, our work goes beyond merely demonstrating that enforcing concise and certain responses successfully boosts mathematical reasoning; we reveal when this unsupervised approach breaks down and geometrically diagnose why.
[137] Efficient Reasoning on the Edge cs.LG | cs.CLPDF
Yelysei Bondarenko, Thomas Hehn, Rob Hesselink, Romain Lepert, Fabio Valerio Massoli
TL;DR: 本文提出了一种在资源受限的边缘设备上实现高效推理的方法,通过结合LoRA适配器与监督微调,并引入基于强化学习的预算强制机制来缩短响应长度,同时利用并行测试时缩放和动态适配器切换机制,以及KV-cache共享策略,以在移动设备上实现准确且高效的LLM推理。
Details
Motivation: 解决大型语言模型在边缘部署时因冗长推理轨迹、大上下文需求导致的高令牌生成成本、大KV-cache占用以及将推理能力蒸馏到小模型效率低下的问题。
Result: 在Qwen2.5-7B模型上的实验表明,该方法在严格资源约束下实现了高效准确的推理,适合移动场景。
Insight: 创新点包括:结合LoRA适配器与监督微调实现轻量级推理;通过强化学习预算强制减少响应长度;利用并行测试时缩放提升精度;动态适配器切换按需激活推理;KV-cache共享策略降低首次令牌生成时间。这些技术协同优化了边缘设备上的推理效率与准确性。
Abstract: Large language models (LLMs) with chain-of-thought reasoning achieve state-of-the-art performance across complex problem-solving tasks, but their verbose reasoning traces and large context requirements make them impractical for edge deployment. These challenges include high token generation costs, large KV-cache footprints, and inefficiencies when distilling reasoning capabilities into smaller models for mobile devices. Existing approaches often rely on distilling reasoning traces from larger models into smaller models, which are verbose and stylistically redundant, undesirable for on-device inference. In this work, we propose a lightweight approach to enable reasoning in small LLMs using LoRA adapters combined with supervised fine-tuning. We further introduce budget forcing via reinforcement learning on these adapters, significantly reducing response length with minimal accuracy loss. To address memory-bound decoding, we exploit parallel test-time scaling, improving accuracy at minor latency increase. Finally, we present a dynamic adapter-switching mechanism that activates reasoning only when needed and a KV-cache sharing strategy during prompt encoding, reducing time-to-first-token for on-device inference. Experiments on Qwen2.5-7B demonstrate that our method achieves efficient, accurate reasoning under strict resource constraints, making LLM reasoning practical for mobile scenarios. Videos demonstrating our solution running on mobile devices are available on our project page.
[138] How to Achieve Prototypical Birth and Death for OOD Detection? cs.LG | cs.CVPDF
Ningkang Peng, Qianfeng Yu, Xiaoqian Peng, Linjing Qian, Yafei Liu
TL;DR: 本文提出了一种名为PID(原型生灭)的动态原型学习方法,用于改进分布外(OOD)检测。该方法受生物学中细胞生灭过程的启发,通过原型出生和原型死亡两种机制,在训练过程中根据数据复杂度自适应地调整原型数量,从而学习到更紧凑、分离度更好的分布内(ID)嵌入表示,显著提升了OOD样本的检测能力。
Details
Motivation: 现有基于原型的OOD检测方法通常依赖固定数量的原型,这种静态假设无法适应不同类别固有的复杂度差异,缺乏根据数据复杂度自适应调整原型数量的机制。
Result: 实验表明,所提出的动态方法PID在CIFAR-100等基准测试上显著优于现有方法,尤其是在FPR95指标上达到了最先进的(SOTA)性能。
Insight: 核心创新点在于引入了受生物学启发的动态原型生灭机制:出生机制通过在表征不足的数据区域实例化新原型来精细捕捉类内子结构;死亡机制通过评估原型的可区分性来修剪边界模糊的原型以强化决策边界。这实现了根据数据复杂度动态调整原型数量,从而优化了嵌入空间。从客观角度看,这是一种将动态结构学习与原型表示相结合的创新思路,为解决静态原型假设的局限性提供了新途径。
Abstract: Out-of-Distribution (OOD) detection is crucial for the secure deployment of machine learning models, and prototype-based learning methods are among the mainstream strategies for achieving OOD detection. Existing prototype-based learning methods generally rely on a fixed number of prototypes. This static assumption fails to adapt to the inherent complexity differences across various categories. Currently, there is still a lack of a mechanism that can adaptively adjust the number of prototypes based on data complexity. Inspired by the processes of cell birth and death in biology, we propose a novel method named PID (Prototype bIrth and Death) to adaptively adjust the prototype count based on data complexity. This method relies on two dynamic mechanisms during the training process: prototype birth and prototype death. The birth mechanism instantiates new prototypes in data regions with insufficient representation by identifying the overload level of existing prototypes, thereby meticulously capturing intra-class substructures. Conversely, the death mechanism reinforces the decision boundary by pruning prototypes with ambiguous class boundaries through evaluating their discriminability. Through birth and death, the number of prototypes can be dynamically adjusted according to the data complexity, leading to the learning of more compact and better-separated In-Distribution (ID) embeddings, which significantly enhances the capability to detect OOD samples. Experiments demonstrate that our dynamic method, PID, significantly outperforms existing methods on benchmarks such as CIFAR-100, achieving State-of-the-Art (SOTA) performance, especially on the FPR95 metric.
[139] Discovering the Hidden Role of Gini Index In Prompt-based Classification cs.LG | cs.AI | cs.CVPDF
Ruixi Lin
TL;DR: 本文揭示了基尼指数在基于提示的分类任务中作为检测和优化类别准确率不平衡的隐藏作用,提出了一种模型无关的后处理偏置缓解方法,并在少样本新闻、生物医学和零样本图像分类任务中验证了其有效性。
Details
Motivation: 解决长尾分布中少数类别准确率低、少数高表现类别主导预测的问题,探索基尼指数在基于提示的分类中作为衡量和优化类别准确率不平衡的工具。
Result: 在少样本新闻、生物医学和零样本图像分类实验中,该方法显著减少了相对和绝对准确率不平衡,最小化了顶部类别的相对主导地位,同时提升了最弱类别的表现。
Insight: 创新点在于将基尼指数引入为衡量类别准确率不平衡的指标,并直接作为优化目标;提出了一种模型无关的后处理偏置缓解方法,适用于文本和图像分类任务,可有效提升长尾分布下的分类公平性。
Abstract: In classification tasks, the long-tailed minority classes usually offer the predictions that are most important. Yet these classes consistently exhibit low accuracies, whereas a few high-performing classes dominate the game. We pursue a foundational understanding of the hidden role of Gini Index as a tool for detecting and optimizing (debiasing) disparities in class accuracy, focusing on the case of prompt-based classification. We introduce the intuitions, benchmark Gini scores in real-world LLMs and vision models, and thoroughly discuss the insights of Gini not only as a measure of relative accuracy dominance but also as a direct optimization metric. Through rigorous case analyses, we first show that weak to strong relative accuracy imbalance exists in both prompt-based, text and image classification results and regardless of whether the classification is high-dimensional or low-dimensional. Then, we harness the Gini metric to propose a post-hoc model-agnostic bias mitigation method. Experimental results across few-shot news, biomedical, and zero-shot image classification show that our method significantly reduces both relative and absolute accuracy imbalances, minimizing top class relative dominance while elevating weakest classes.
[140] Collaborative Temporal Feature Generation via Critic-Free Reinforcement Learning for Cross-User Sensor-Based Activity Recognition cs.LG | cs.AI | cs.CVPDF
Xiaozhou Ye, Feng Jiang, Zihan Wang, Xiulai Wang, Yutao Zhang
TL;DR: 本文提出了一种名为CTFG(协作时序特征生成)的新框架,用于解决基于传感器的跨用户活动识别中的泛化问题。该方法将可泛化的特征提取建模为一个由强化学习驱动的协作序列生成过程,使用基于Transformer的自回归生成器逐步构建特征标记序列,并通过无评论家的组相对策略优化算法进行优化,结合分类判别、跨用户不变性和时序保真度的三目标奖励来塑造特征空间。
Details
Motivation: 动机在于解决基于可穿戴惯性传感器的人类活动识别在跨用户部署时,因生理特征、运动习惯和传感器放置等异质性导致的性能下降问题,现有领域泛化方法要么忽略了传感器流中的时序依赖,要么依赖于不切实际的目标域标注。
Result: 在DSADS和PAMAP2基准测试上,该方法实现了最先进的跨用户准确率(分别为88.53%和75.22%),显著降低了任务间训练方差,加速了收敛,并在不同动作空间维度下表现出鲁棒的泛化能力。
Insight: 创新点在于将特征提取重新定义为协作序列生成问题,并引入了无评论家的组相对策略优化算法,通过组内归一化而非学习价值估计来获取优势,消除了基于评论家方法的分布依赖性偏差,提供了跨异质用户分布稳定的自校准优化信号;同时,结合了分类、对齐和保真度的多目标奖励机制也是一个关键设计。
Abstract: Human Activity Recognition using wearable inertial sensors is foundational to healthcare monitoring, fitness analytics, and context-aware computing, yet its deployment is hindered by cross-user variability arising from heterogeneous physiological traits, motor habits, and sensor placements. Existing domain generalization approaches either neglect temporal dependencies in sensor streams or depend on impractical target-domain annotations. We propose a different paradigm: modeling generalizable feature extraction as a collaborative sequential generation process governed by reinforcement learning. Our framework, CTFG (Collaborative Temporal Feature Generation), employs a Transformer-based autoregressive generator that incrementally constructs feature token sequences, each conditioned on prior context and the encoded sensor input. The generator is optimized via Group-Relative Policy Optimization, a critic-free algorithm that evaluates each generated sequence against a cohort of alternatives sampled from the same input, deriving advantages through intra-group normalization rather than learned value estimation. This design eliminates the distribution-dependent bias inherent in critic-based methods and provides self-calibrating optimization signals that remain stable across heterogeneous user distributions. A tri-objective reward comprising class discrimination, cross-user invariance, and temporal fidelity jointly shapes the feature space to separate activities, align user distributions, and preserve fine-grained temporal content. Evaluations on the DSADS and PAMAP2 benchmarks demonstrate state-of-the-art cross-user accuracy (88.53% and 75.22%), substantial reduction in inter-task training variance, accelerated convergence, and robust generalization under varying action-space dimensionalities.