Table of Contents

cs.CL [Back]

[1] CoreMem: Riemannian Retrieval and Fisher-Guided Distillation for Long-Term Memory in Dialogue Agents cs.CLPDF

Jiaqi Chen, Yongqin Zeng, Shaoshen Chen, Yijian Zhang, Hai-Tao Zheng

TL;DR: CoreMem提出了一种资源高效的边缘-云内存架构,用于对话代理的长期记忆。它通过信息几何统一了检索和压缩过程:使用局部自适应的Fisher-Rao度量进行黎曼检索以缓解枢纽问题,并利用费希尔信息迹指导的分层句子到令牌蒸馏进行有原则的压缩。

Details

Motivation: 解决在消费级硬件(如8GB VRAM边缘设备)上部署个性化对话代理时,因长期记忆需求而面临的严重内存和计算瓶颈问题。现有基于各向同性余弦相似度的检索和启发式压缩方法缺乏统一理论基础,存在高维检索的枢纽问题和压缩时的句法碎片化。

Result: 在LOCOMO和LongMemEval-S基准测试中,CoreMem在开放域推理(+4.51 pp)和时间推理(+4.17 pp)方面取得了显著准确率提升。性能分析证实其能在严格的8GB VRAM预算内无缝运行。

Insight: 创新点在于将信息几何(Fisher-Rao度量、费希尔信息)统一应用于记忆系统的检索和压缩,为资源受限环境下的终身记忆代理提供了理论基础。具体包括:用局部自适应度量替代余弦相似度以惩罚枢纽记忆,以及基于信息敏感度的有原则压缩-KL权衡机制。

Abstract: Personalized dialogue agents require continuous long-term memory to maintain coherent interactions across multiple sessions. However, deploying these capabilities on consumer-grade hardware (e.g., 8 GB VRAM edge devices) introduces severe memory and compute bottlenecks. Existing systems typically rely on isotropic cosine similarity for retrieval and heuristic rules for context compression. These approaches lack a unified theoretical foundation, frequently suffering from the hubness problem in high-dimensional retrieval and syntactic fragmentation during compression. To overcome these limitations, we propose CoreMem, a resource-efficient edge-cloud memory architecture fundamentally unified by information geometry. First, Riemannian retrieval replaces cosine matching with a locally adaptive Fisher-Rao metric, effectively penalizing hub memories via Mahalanobis distance with O(Ndr) Woodbury acceleration for real-time search. Second, Fisher-guided discrete token distillation (FDTD) introduces a hierarchical sentence-to-token compression mechanism. It derives sensitivity scores from Fisher information traces, providing a principled compression-KL tradeoff augmented with explicit structural syntax protection. Evaluated on the LOCOMO and LongMemEval-S benchmarks, CoreMem achieves strong accuracy improvements, yielding substantial gains in Open-domain (+4.51 pp) and Temporal (+4.17 pp) reasoning. Extensive profiling confirms that CoreMem operates seamlessly within a strict 8 GB VRAM budget, successfully bridging the gap between resource-constrained edge devices and the demand for theoretically grounded, lifelong memory agents.


[2] LLM Parameters for Math Across Languages: Shared or Separate? cs.CLPDF

Behzad Shomali, Luisa Victor, Tim Selbach, Ali Hamza Bashir, David Berghaus

TL;DR: 这篇论文研究了多语言大语言模型(LLMs)中数学推理能力的跨语言差异机制。通过跨语言机制分析,作者发现与数学推理相关的模型参数在不同语言间存在部分重叠,且重叠主要集中在模型的中间层。英语触发的数学相关参数集合最大,而低资源语言触发的集合较小,表明数学行为既非完全语言不变也非完全语言特定,而是存在系统性的语言依赖差异。

Details

Motivation: 动机在于探究LLMs在不同语言间数学推理性能存在显著差异的根本原因,即这种差异是源于语言特定的参数,还是源于一个共享机制在不同语言中的不同表现。

Result: 研究结果表明,提取出的数学相关参数在不同语言间存在部分重叠,最强重叠集中在模型中间层。英语产生的数学相关参数集合最大,低资源语言产生的集合较小。

Insight: 创新点在于提出了一个跨语言机制分析方法,能够定位和比较支持不同语言数学推理的模型参数。核心发现是数学推理行为在多语言LLMs中呈现出部分跨语言参数重叠与系统性语言依赖差异共存的混合机制,而非简单的二元对立,这为理解模型内部表征提供了新视角。

Abstract: Large language models (LLMs) exhibit substantial cross-lingual variation in mathematical reasoning performance, but it remains unclear whether these differences reflect language-specific parameters or a shared mechanism that manifests differently by language. We present a cross-lingual mechanistic analysis of mathematical reasoning in LLMs, enabling us to localize and compare model parameters that support mathematical reasoning across languages. We find that the extracted math-associated parameters exhibit partial cross-lingual overlap, with the strongest overlap concentrated in intermediate model layers. We further observe that English consistently produces the largest set of math-relevant parameters, whereas lower-resource languages reveal smaller sets of relevant parameters. These results suggest that math-related behavior in multilingual LLMs is neither fully language-invariant nor fully language-specific, but instead exhibits partial cross-lingual parameter overlap with systematic language-dependent differences.


[3] Continuous Audio Thinking for Large Audio Language Models cs.CL | cs.AI | cs.SD | eess.ASPDF

Gyojin Han, Dong-Jae Lee, Changho Choi, Jongsuk Kim, Junmo Kim

TL;DR: 本文提出了一种名为连续音频思维(CoAT)的框架,旨在增强大型音频语言模型(LALMs)在生成文本响应前对音频声学信息的保留和利用能力。该框架通过从音频专家模型蒸馏知识,为模型提供了一个连续的潜在工作空间来组织声学信息,从而在不增加自回归解码成本的情况下,提升模型在音频理解、推理、音乐分类、语音情感识别和转录等多种任务上的性能。

Details

Motivation: 现有的LALMs通常被训练生成与文本对齐的响应,导致其隐藏状态逐渐偏向文本生成,而音频携带的丰富声学信息(如语音细节、韵律、声音事件、情感和音高)在过程中丢失,难以在最终响应中被有效利用。

Result: 在Qwen2-Audio、Qwen2.5-Omni-7B和Audio Flamingo~3三个LALMs上的实验表明,CoAT在涵盖音频推理、音频理解、音乐分类、语音情感和语音转录的广泛基准测试套件中均带来了性能提升,证明了其有效性。

Insight: 核心创新点是为音频语言模型引入了一个连续的潜在“思维空间”,该空间通过专家蒸馏进行监督,使得模型在生成响应前能先组织并利用丰富的声学信息。这一设计允许在单次预填充中处理思维块,不增加额外的自回归解码开销,实现了效率与性能的平衡。

Abstract: Large audio language models (LALMs) have shown impressive capabilities on diverse audio understanding tasks, ranging from speech transcription to music analysis. However, because LALMs are typically trained to produce text-aligned responses, their hidden states are progressively shaped for text generation rather than for preserving acoustic information. As a result, the diverse acoustic content that audio carries, such as phonetic detail, prosody, sound events, affect, and pitch, is lost along the way and difficult to leverage in the response. We introduce Continuous Audio Thinking (CoAT), a framework that equips audio language models with a continuous latent workspace for organizing acoustic information prior to response generation, grounded by distillation from audio experts. Within the thinking space, the model can utilize the rich acoustic information provided by expert distillation when generating its response. Furthermore, the proposed continuous thinking block can be processed in a single prefill, so CoAT does not require additional autoregressive decoding cost over the baseline. Across three LALMs, Qwen2-Audio, Qwen2.5-Omni-7B, and Audio Flamingo~3, performance gains on a broad benchmark suite spanning audio reasoning, audio understanding, music classification, speech emotion, and speech transcription demonstrate the effectiveness of CoAT. Further analysis confirms that the auxiliary supervision propagates from the thinking positions to the model’s textual responses.


[4] VISUALSKILL: Multimodal Skills for Computer-Use Agents cs.CLPDF

Ziyan Jiang, Li An, Yujian Liu, Jiabao Ji, Qiucheng Wu

TL;DR: 本文提出了VISUALSKILL,一种为计算机使用代理设计的层次化多模态技能库。它针对每个目标应用程序定制,通过一个中央索引组织主题文件,代理可以通过一个按需获取相关主题文本和图像的MCP工具来使用这些技能。该方法通过结合编写文档和实时应用程序UI探索的两阶段流程构建技能。

Details

Motivation: 现有技能库仅用文本表示技能,忽略了GUI交互的视觉本质,导致代理在长视野任务和未见软件上表现不佳。本文旨在通过引入视觉信息来解决这一问题。

Result: 在CUA-World和OSExpert-Eval两个基准测试上,使用Claude Opus 4.6的代理在配备VISUALSKILL后平均得分达到0.456,相比无技能基线(0.303)绝对提升了15.3分。与从相同源内容生成、仅模态不同的纯文本技能相比,VISUALSKILL进一步带来了8.3分的绝对增益(0.456 vs. 0.373),证明了保留视觉图像的有效性。

Insight: 核心创新点是提出了一个层次化、多模态(结合文本和图像)的技能表示方法,并设计了一个两阶段的技能构建流程。客观来看,其关键洞见在于明确证明了在技能库中保留视觉信息(而非仅用文字描述)能显著帮助代理识别UI元素和验证操作后的工作流状态,从而提升性能。

Abstract: Computer-use agents (CUAs) approach human-level performance on standardised benchmarks but still struggle on long-horizon tasks and unseen software. Existing skill libraries address this with reusable skills, but represent the skill artifact as text only, despite the visual nature of GUI interaction. We propose VISUALSKILL: a hierarchical multimodal skill, tailored to each target application and organised as a central index over per-topic files, which the agent consumes through a load_topic MCP tool that fetches the relevant topic’s text and figures on demand. We construct each skill with a two-stage pipeline that combines authored documentation with live-application UI exploration. On two CUA benchmarks, CUA-World and OSExpert-Eval, a Claude Code CLI agent backed by Claude Opus 4.6 reaches an average score of 0.456 with VISUALSKILL, a +15.3 point absolute lift over the no-skill baseline (0.303). Against a matched text-only skill that is generated from the same source content and differs from VISUALSKILL only in modality, VISUALSKILL yields a further +8.3 point absolute gain over the matched text-only skill (0.373 vs. 0.456), providing direct evidence that retaining visual figures in the skill artifact, rather than verbalizing them away, helps the agent both identify UI elements and verify workflow state after each action. Our code is available at https://github.com/XMHZZ2018/VisualSkills.


[5] Towards Scalable Customization and Deployment of Multi-Agent Systems for Enterprise Applications cs.CLPDF

Paresh Dashore, Shreyas Kulkarni, Uttam Gurram, Nadia Bathaee, Kartik Balasubramaniam

TL;DR: 本文提出一个面向企业应用的多智能体系统定制与部署的统一框架,旨在解决领域定制需求与推理成本高的问题。该框架包含智能体模型定制和推理优化两个阶段,通过持续预训练、监督微调、偏好优化以及推测解码和FP8量化等技术,实现快速领域适应和高效推理。

Details

Motivation: 解决基于大语言模型的多智能体系统在企业应用中面临的领域定制困难、高延迟和高推理成本等生产部署挑战。

Result: 在企业工作负载上,该框架实现了快速领域适应,吞吐量提升4.48倍,同时保持性能并在长尾场景中提高了鲁棒性。

Insight: 创新点在于将模型定制(持续预训练、监督微调、偏好优化)与推理优化(推测解码、FP8量化校准)结合在一个统一框架中,以紧凑模型实现高效且可定制的多智能体系统部署。

Abstract: Large language model (LLM)-based multi-agent systems demonstrate strong performance on complex reasoning and task execution, enabling broad enterprise applications. However, production deployment remains challenging due to domain-specific customization requirements and high latency and inference costs in agentic workflows. We propose a unified framework for customization and efficient deployment of multi-agent systems in real-world settings. The first stage, Agentic Model Customization, combines continual pretraining, supervised fine-tuning, and preference optimization to adapt a compact model to specialized domains while retaining strong agentic capabilities. The second stage, Inference Optimization, integrates speculative decoding and FP8 quantization with targeted calibration to enable cost-efficient serving with minimal quality loss. Across enterprise workloads, our framework enables rapid domain adaptation and achieves a 4.48x speedup in throughput while maintaining performance and improving robustness on long-tail scenarios.


[6] Possible or Definite? A Benchmark for Evaluating Diagnostic Uncertainty Preservation in Clinical Text cs.CLPDF

Hongbo Du, Zixin Lu, Jiaming Qu

TL;DR: 本文构建了一个包含1,200份临床文档、9,184个不确定性标注的基准测试,用于评估大语言模型在临床文本任务中保留诊断不确定性的能力。研究发现,LLMs在保留原始不确定性线索方面表现不佳,且难以区分相邻的不确定性级别。

Details

Motivation: 当前研究大多评估LLM生成文本的流畅性和连贯性,但LLM是否正确保留临床诊断中的不确定性(如“可能肺炎”这类表达)尚未得到充分探索,而这类不确定性表达直接指导后续检测和治疗决策,改变它们会完全改变临床意义。

Result: 在构建的基准上评估了三个LLMs,结果显示:(1)LLMs保留原始不确定性线索的能力很差,通常不到一半的时间;(2)LLMs难以区分相邻不确定性级别之间的细微差别。

Insight: 论文的创新点在于首次系统性地构建了评估临床诊断不确定性保留的基准,并揭示了标准评估指标未能捕捉到的LLM失败模式,为LLM在临床工作流程中的安全部署提供了重要启示。

Abstract: Large language models (LLMs) are increasingly used for clinical text tasks such as summarization and revision. While most studies evaluate the fluency and coherence of LLM-generated text, whether LLMs correctly preserve diagnostic uncertainty remains underexplored. In clinical practice, phrases such as ``possible pneumonia’’ communicate the strength of available evidence and directly guide decisions about follow-up testing and treatment. Altering these uncertainty expressions can change the clinical meaning entirely. In this paper, we systematically evaluated this problem in two steps. First, we constructed a benchmark of 1,200 clinical documents with 9,184 uncertainty annotations across five levels. Second, we evaluated three LLMs on this benchmark. Our results show that (1) LLMs preserve the original uncertainty cues poorly, often less than half the time; (2) LLMs struggle with nuanced distinctions between adjacent levels. This work reveals a failure mode not captured by standard evaluation metrics and provides implications for the safe deployment of LLMs in clinical workflows.


[7] PragReST: Self-Reinforcing Counterfactual Reasoning for Pragmatic Language Understanding cs.CLPDF

Jihyung Park, Minchao Huang, Leqi Liu, Elias Stengel-Eskin

TL;DR: 本文提出了PragReST框架,旨在提升大语言模型(LLMs)的语用推理能力。该框架通过自监督方式构建语用问答数据、生成反事实推理轨迹,并利用监督微调和强化学习让模型内化这些推理,无需人工标注数据或从更强教师模型蒸馏。在四个语用基准测试上,该方法超越了骨干模型、任务特定的语用调优基线以及非反事实变体,并保持了在通用知识和数学推理基准上的性能。

Details

Motivation: 大语言模型虽然在数学和逻辑推理上表现强劲,但在处理需要语用推理(即理解隐含而非明确陈述的含义)的任务时,仍倾向于选择字面解释,存在困难。

Result: 在PragMega、Ludwig、MetoQA和AltPrag四个语用基准上,PragReST超越了骨干模型和基线方法。具体地,在基于准确率的基准上,PragReST使Qwen3-8B和Qwen3-14B相比指令微调骨干模型的绝对准确率分别提升了5.37%和5.50%。

Insight: 核心创新在于提出了一个完全自监督的框架,通过生成和利用反事实推理轨迹来专门提升语用推理能力。关键洞察是,反事实推理(对比观察到的表述与合理的替代方案)对于减少语用错误至关重要,移除它会显著降低性能。该方法能在不损害领域外通用推理能力的前提下,有效提升特定领域的语用理解。

Abstract: Natural language understanding often depends on meanings that are implied rather than explicitly stated, requiring pragmatic reasoning. Despite strong performance on math and logical reasoning, large language models (LLMs) still struggle with making pragmatic inferences, often choosing literal interpretations. To improve LLM pragmatic reasoning, we introduce PragReST, a self-supervised framework that constructs pragmatic QA data, generates counterfactual reasoning traces, and trains models to internalize them through supervised fine-tuning and reinforcement learning, without human-labeled training data or distillation from a stronger teacher. Across four pragmatic benchmarks (PragMega, Ludwig, MetoQA, and AltPrag), PragReST improves over backbone models, task-specific pragmatic tuning baselines, and non-counterfactual variants of the same pipeline. On accuracy-based benchmarks, PragReST improves over the instruct backbone by 5.37 and 5.50% (absolute) for Qwen3-8B and Qwen3-14B, respectively. Our error analysis and ablations underscore the importance of counterfactual reasoning: PragReST primarily reduces errors caused by failures to contrast observed utterances with plausible alternatives, and removing counterfactual reasoning substantially reduces performance. Moreover, our training preserves out-of-domain performance on general-knowledge and mathematical reasoning benchmarks.


Fei-Yueh Chen, Chun Huang Lin, Chan Wei Hsu, Kuan Hsuan Yeh, Zih-Ching Chen

TL;DR: 本文提出了TW-LegalBench,一个专门用于评估大语言模型在台湾法律体系下理解与推理能力的基准测试集。该基准包含选择题、开放式问答题和法律判决预测三类任务,覆盖了18个专业领域和数百种犯罪类别。通过对13个LLM的评估,发现顶尖模型在律师资格考试上的表现接近人类水平,但在法官/检察官考试和法律条文精确引用方面仍有不足。

Details

Motivation: 现有法律推理基准主要关注英美普通法系(英文来源)或中国大陆的民法法系(简体中文来源),缺乏针对台湾法律体系的评估工具。本文旨在填补这一空白,利用台湾公开的官方法律语料库构建一个全面的评测基准。

Result: 在TW-LegalBench上评估了13个LLM。结果显示,表现最佳的模型在律师资格考试(通过率11%)上超过了合格线,但在法官和检察官考试(通过率1~2%)上未达标。在法律判决预测任务中,模型在判决类型和刑期预测上表现尚可,但在精确引用法律条文方面存在困难。

Insight: 创新点在于构建了首个专注于台湾法律体系的综合性多任务评测基准,并引入了基于官方评分标准的LLM-as-Judge分解评估框架用于开放式问题。客观来看,该工作强调了LLM在法律专业领域(尤其是跨法系和精确文本生成)的局限性,为未来法律AI研究提供了重要的评估数据和方向。

Abstract: Large language models (LLMs) have shown impressive capabilities across diverse tasks, yet their performance on jurisdiction-specific legal reasoning remains underexplored. We present TW-LegalBench that utilizes Taiwanese legal system’s rich official corpus open to the public to fill the gap in evaluating LLMs on Taiwanese law, among common-law benchmarks that focus on English sources and civil-law benchmarks focusing on sources of Simplified Chinese. TW-LegalBench comprises three task types: (1) over 16,000 multiple-choice questions (MCQs) across five years of official examinations in 18 professional domains; (2) 117 open-ended essay questions (OEQs) from examinations for legal professionals with official scoring rubrics; and (3) more than 14,000 legal judgment prediction (LJP) instances covering hundreds of crime categories. We evaluate 13 LLMs using accuracy for MCQs, a decomposed LLM-as-Judge framework based on the scoring rubric points for OEQs, and metrics for sentencing accuracy and statute citation for LJP. Our results reveal that top-performing models exceed the passing threshold for qualified lawyers (passing rate: 11%) but fall short of that for judges and prosecutors (passing rate: 1~2%). For LJP, while models demonstrate reasonable verdict type accuracy and sentence prediction capability, they struggle to cite exact legal articles. These findings highlight that reliable legal text generation remains challenging for LLMs, even though their performance on qualification examinations approaches human level.


[9] ScholarSum: Student-Teacher Abstractive Summarization via Knowledge Graph Reasoning and Reflective Refinement cs.CL | cs.IRPDF

Bohou Zhang, Xiaoyu Tao, Mingyue Cheng, Huijie Liu, Qi Liu

TL;DR: 本文提出了ScholarSum,一种基于层次化反思图的学生-教师框架,用于生成流畅且忠实于原文的科学文献摘要。该方法首先将文档组织成层次化知识图以捕捉全局逻辑,然后通过学生生成初稿、教师迭代审查与精炼的过程,结合细粒度证据检索来确保事实一致性。

Details

Motivation: 现有摘要方法难以同时满足语言流畅性和事实忠实性:抽取式方法破坏宏观逻辑连贯性,而基于大语言模型的生成方法虽流畅但事实一致性有限。

Result: 大量实验表明,ScholarSum在完整性和忠实性方面显著优于先前基线方法。

Insight: 创新点在于模拟学生-教师写作过程,通过层次化知识图推理与反思性精炼机制,将全局结构引导与细粒度证据检索相结合,以提升摘要的事实一致性。

Abstract: Abstractive summarization plays a crucial role in enabling efficient understanding of scientific literature, yet it inherently demands both linguistic fluency and factual faithfulness. Existing approaches often fail to reconcile these two requirements. Extractive methods rely on rigid sentence splicing that disrupts macro-level logical coherence, while large language model (LLM)-based generative approaches, despite mastering linguistic fluency, exhibit limited factual consistency. In this work, we propose ScholarSum, a hierarchical reflective graph-based framework that emulates a student-teacher writing process for fluent and faithful scientific summarization. ScholarSum first organizes the document into a hierarchical knowledge graph by segmenting it into semantically coherent units, whose multi-layered community structure captures global logic and macro-level themes. Guided by this global structure, the student generates an initial draft, which is subsequently refined through fine-grained evidence retrieval. To ensure factual consistency, a teacher-like reviewer then iteratively examines the draft, identifies unsupported content, and prompts targeted re-retrieval and rewriting until the summary meets rigorous quality standards. Extensive experiments demonstrate that ScholarSum significantly outperforms previous baselines in terms of both completeness and faithfulness. Our code is available at https://github.com/Xiaoyu-Tao/ScholarSum.


[10] Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning cs.CL | cs.AIPDF

Xiaoyue Xu, Sikui Zhang, Xiaorong Wang, Xu Han, Chaojun Xiao

TL;DR: 本文从数据中心的视角重新审视了长上下文推理问题,提出了一种简单而有效的数据配方,配合基于结果的最小化GRPO设置,即可显著提升大语言模型的长上下文推理能力。该配方针对检索、多证据合成和推理三个互补的任务族,构建了总计约14K样本的八个数据集。实验表明,该数据配方在多个长上下文基准测试和智能体任务上带来了显著的性能提升。

Details

Motivation: 现有工作主要集中于奖励工程,而多样化的训练数据仍然稀缺。本文旨在从数据中心的角度解决长上下文推理能力提升的问题,强调高质量、多样化的训练数据本身的重要性。

Result: 在Qwen3-4B/8B/30B-A3B三个模型上的实验表明,在七个长上下文基准测试上平均分别提升了+7.2/+3.2/+6.4分,超越了先前的RL训练集。在智能体任务上,使用该数据配方继续训练能将GAIA和BrowseComp任务分别提升+4.8分和+7.0分。

Insight: 核心创新点在于提出了一个针对长上下文推理的通用数据配方,强调了高质量、多样化数据相对于复杂奖励工程的重要性。其构建的三个互补任务族(检索、合成、推理)的数据集,为通过强化学习提升模型的长上下文能力提供了一个可复现且有效的数据基础。

Abstract: Long-context reasoning is an essential capability for large language models, particularly when they are deployed as autonomous agents that must reason over lengthy trajectories. Reinforcement learning (RL) has recently emerged as a dominant paradigm for improving this ability, yet existing work largely focuses on reward engineering while diverse training data remains scarce. We revisit this problem from a data-centric perspective and show that a simple yet effective data recipe alone, paired with a minimal outcome-based GRPO setup, suffices to substantially improve long-context reasoning. Our recipe targets three complementary task families – retrieval, multi-evidence synthesis, and reasoning – for which we construct and curate eight datasets totaling ~14K examples. Experiments on three models (Qwen3-4B/8B/30B-A3B) yield average gains of +7.2/+3.2/+6.4 points across seven long-context benchmarks, surpassing prior RL training sets. We further demonstrate that these gains transfer to agentic tasks, where continuing RL training on an agent-tuned model with our data recipe improves GAIA by +4.8 and BrowseComp by +7.0 points. We will release our datasets to facilitate future research.


[11] Efficient Financial Language Understanding via Distillation with Synthetic Data cs.CLPDF

Wen-Fong, Huang, Edwin Simpson

TL;DR: 本文提出了一种通过合成数据进行知识蒸馏的高效金融情感分析框架,旨在解决金融领域标注数据稀缺且部署大型指令跟随模型成本高昂的问题。该框架在低资源条件下,利用少量人工标注的真实示例进行聚类,并基于聚类结果选择种子,通过结构化少样本提示生成合成数据,从而将大型指令调优教师模型的知识迁移到紧凑的学生模型中。

Details

Motivation: 动机在于金融领域因数据保密性和专家标注成本高导致标注数据有限,而大型指令跟随模型虽然强大但部署成本高昂,因此需要一种资源高效的领域适应方法,在最小化人工标注努力的情况下实现有效的金融语言理解。

Result: 实验表明,基于聚类的种子选择比随机采样能生成更具代表性的合成数据,使紧凑模型在最小监督下实现强劲性能;在一个更复杂和嘈杂的文本领域,使用完整合成-种子语料库训练的紧凑模型甚至超越了教师模型,同时在正式文本上保持竞争力。

Insight: 创新点在于提出了一个针对低资源金融NLP的领域适应框架,其核心是通过聚类选择种子来指导合成数据生成,以提高数据代表性,从而有效进行知识蒸馏;从客观角度看,该方法将数据生成与模型压缩相结合,为资源受限下的领域专用模型部署提供了实用路径。

Abstract: Large instruction-following models are powerful but costly to deploy, particularly in finance, where labelled data are limited by confidentiality and expert annotation cost. We present an efficient framework for financial sentiment analysis through distillation with synthetic data, transferring knowledge from a large instruction-tuned teacher to compact student models. The framework is designed for low-resource conditions, where a small set of real examples are collected and labelled by hand. The framework then clusters the examples and uses the clusters to select seeds for generating synthetic examples via structured few-shot prompting. Experiments show that clustering-based seed selection yields more representative synthetic data than random sampling, enabling compact models to achieve strong performance with minimal supervision. Notably, on a more complex and noisy text domain, the compact model trained on the complete synthetic-seed corpus even outperforms the teacher model, while remaining competitive on formal text. The framework provides a practical route toward resource-efficient domain adaptation in financial NLP with minimal human labelling effort.


[12] Improving Medical Communication using Rubric-Guided Counterfactual Recommendations cs.CLPDF

Adrian Cosma, Nicoleta-Nina Basoc, Andrei Niculae, Cosmin Dumitrache, Emilian Radoi

TL;DR: 本文提出了一种基于语言模型的反事实推荐管道,用于改进医疗沟通。该方法在不改变医疗内容的前提下,通过发现和优化沟通特征(如语气、个性化、可操作性和完整性)来提升患者反馈的积极概率。系统通过搜索低成本的有序特征变化,推荐最小的沟通修改,并利用独立审计模型验证其泛化能力。

Details

Motivation: 基于文本的远程医疗主要依赖轻量级患者反馈,但这些反馈通常反映的是感知的沟通质量而非医疗准确性。因此,需要一种方法来改进沟通特征,以提升患者满意度,同时不干扰医疗内容。

Result: 在独立审计模型下,推荐方法使预测的积极反馈概率平均提升了+6.41%,且93.31%的推荐为非负收益。这表明该方法能有效捕获预测增益。

Insight: 创新点在于结合反事实推理和可解释的沟通特征(如语气、个性化),通过最小化修改来优化患者反馈,同时保持医生对医疗推理和最终措辞的控制。这为提升医疗沟通质量提供了一种数据驱动且非侵入性的方法。

Abstract: Text-based telemedicine increasingly relies on lightweight patient feedback, however, such feedback primarily reflects perceived communication quality rather than medical accuracy. We introduce an LM-guided counterfactual recommendation pipeline that discovers and refines interpretable communication features such as tone, personalization, actionability and completeness in addressing patient concerns, without interfering with the medical content. These features are used together with patient-doctor interaction metadata to estimate positive feedback. At inference time, the system searches over low-cost ordinal feature changes and recommends minimal communication changes predicted to increase the probability of positive feedback, while independent auditor models test whether these gains generalize beyond the selection model. Across interactions, recommendations yield a mean +6.41% gain in predicted positive feedback probability under independent auditors, and are non-negative for 93.31% of recommendations. These results suggest that small, interpretable communication changes can capture most predicted gains while preserving the doctor’s control over medical reasoning and final wording.


[13] Learning Robust Pair Confidence for Multimodal Emotion-Cause Pair Extraction cs.CLPDF

Zhuangzhuang Pan, Ning Dong, Yingna Su, Yan Xia

TL;DR: 本文提出了一种名为RPCL(鲁棒配对置信度学习)的训练框架,用于解决多模态情感-原因配对提取(MECPE)任务中配对置信度的脆弱性问题。该方法通过引入置信度差异边界约束和对齐干净与污染视图的预测,使配对置信度更具区分性和稳定性,从而提升模型性能。

Details

Motivation: 现有MECPE方法中的配对评分器通常使用配对级交叉熵,这导致配对置信度脆弱,使得正样本与困难负样本难以区分,或依赖于偶然的非黄金上下文。本文旨在通过改进训练策略来学习更鲁棒的配对置信度。

Result: 在ECF、MECAD和MEC4三个数据集上的全模态(文本-音频-视频)实验中,RPCL将三种子平均的Pair F1分数比基线模型提升了2.58到2.83个百分点,并提高了所有数据集的平均Pair AUPRC。诊断分析显示,该方法增大了黄金配对与负样本的置信度差距,并降低了边界违反的严重程度。

Insight: 创新点在于明确地将配对置信度的几何结构作为训练目标,通过置信度差异边界约束和视图对齐来增强其鲁棒性。这提供了一种有效的训练策略,即通过显式地塑造置信度分布来提升MECPE模型的性能,而无需改变推理时的模型结构。

Abstract: Multimodal emotion-cause pair extraction (MECPE) requires reliable pair confidence over candidate pairs. Existing pair scorers commonly use pair-level cross entropy over valid candidates, which treats links mostly independently. This leaves the relative confidence geometry among competing causes under-constrained, allowing gold pairs to stay close to hard negatives or rely on incidental non-gold context. We study this vulnerability as pair-confidence brittleness and propose RPCL (Robust Pair Confidence Learning), a training-only framework for pair-confidence learning. RPCL encourages pair confidence to be both discriminative and stable: gold pairs are separated from row-wise hard negatives through a confidence-difference margin constraint, and clean pair predictions are aligned with predictions from a corrupted view where non-gold contextual utterance representations are partially corrupted. The original clean pair scorer and decoding pipeline are used unchanged at inference time. On ECF, MECAD, and MEC4, RPCL improves the three-seed mean Pair F1 over a matched base model by 2.58 to 2.83 percentage points in the full text-audio-video setting, and improves mean Pair AUPRC on all three datasets. Diagnostic analysis further shows larger gold-negative confidence gaps and lower margin-violation severity. These results suggest that explicitly shaping pair confidence is an effective training strategy for MECPE.


[14] Enhancing Multilingual Reasoning via Steerable Model Merging cs.CLPDF

Zhuoran Li, Rui Xu, Jian Yang, Junnan Liu, Zhijun Chen

TL;DR: 本文提出了一种可调控的模型融合框架ST-Merge,旨在通过门控交叉注意力机制自适应地加权或过滤多语言模型与推理模型,以解决传统模型融合中源模型冲突导致的性能次优问题。

Details

Motivation: 传统模型融合采用‘一刀切’策略,无法根据不同输入特性动态调整源模型的贡献,导致模型冲突和性能下降。

Result: 在涵盖21种语言的四个多语言推理基准测试上,ST-Merge均一致优于多个强基线模型。

Insight: 创新点在于引入了门控交叉注意力机制,实现了对源模型贡献的自适应调制,从而更灵活地处理不同输入需求。

Abstract: Model merging is an effective technique for composing the capabilities of a multilingual model and a reasoning model. It has achieved promising generalization in multilingual reasoning tasks by aligning feature spaces of different models. However, the merged single model often fails to address the conflicts between source models, leading to suboptimal performance. In other words, the one-size-fits-all merging strategy may not align with the characteristics of different inputs which may require prioritizing certain models over others. To this end, we propose a Steerable Model Merging (ST-Merge) framework to modulate the contribution of each source model. To realize this idea, we introduce a gated cross-attention mechanism to weight or filter the two attended source models in an adaptive manner. Extensive experiments demonstrate that ST-Merge consistently outperforms multiple strong baselines on four multilingual reasoning benchmarks across 21 different languages.


[15] Sumi: Open Uniform Diffusion Language Model from Scratch cs.CL | cs.LGPDF

Mengyu Ye, Keito Kudo, Wataru Ikeda, Ryosuke Matsuda, Keisuke Sakaguchi

TL;DR: 本文介绍了Sumi,一个完全开源的7B参数均匀扩散语言模型,从头在1.5T tokens上预训练而成。该模型在知识、推理和编码基准测试中与在可比token预算下训练的自回归模型表现相当,但在常识基准上表现稍逊,可能与其教育数据比重较高的训练数据混合有关。

Details

Motivation: 目前尚无在大参数规模和大token预算下从头预训练的均匀扩散语言模型(UDLM),这阻碍了社区对均匀扩散的缩放行为、生成动态、可控性及其与自回归和掩码扩散模型权衡的研究。

Result: Sumi在知识、推理和编码基准测试中与可比token预算的自回归模型竞争性相当,但在常识基准上表现不佳。

Insight: 论文的主要创新点在于首次在大规模参数(7B)和大规模token数据(1.5T)下从头预训练了一个开源的均匀扩散语言模型,为研究均匀扩散的缩放特性和生成行为提供了一个干净的参考点,并公开了完整的模型权重、检查点和训练配方,包括公开可用语料库的数据混合规范。

Abstract: Diffusion models have become a promising alternative to autoregressive models. Among these, uniform diffusion language models (UDLMs) permit any token to be updated at any step, in principle enabling more flexible generation. However, no UDLM has yet been pretrained from scratch at both large parameter scale and large token budget. Both autoregressive modeling and masked diffusion modeling already have capable models at scale that the community can study and build on; uniform diffusion has none. A scratch-pretrained UDLM at scale would provide a clean reference point for studying scaling behavior, generation dynamics, controllability, and trade-offs against established autoregressive and masked diffusion models. To this end, we introduce Sumi (“ink” in Japanese), a fully open 7B uniform diffusion language model pretrained from scratch on 1.5T tokens. Sumi performs competitively with autoregressive models trained at comparable token budgets on knowledge, reasoning, and coding benchmarks, while under-performing on commonsense benchmarks, where our education-heavy data mixture is a likely contributor. We release our model weights, checkpoints, and full training recipe, including a complete specification of the data mixture over publicly available corpora. We hope this release enables the community to study native uniform diffusion at scale and catalyzes work on its as-yet poorly understood aspects.


[16] DreamReasoner-8B: Block-Size Curriculum Learning for Diffusion Reasoning Models cs.CLPDF

Zirui Wu, Lin Zheng, Jiacheng Ye, Shansan Gong, Xueliang Zhao

TL;DR: 本文提出了DreamReasoner-8B,一个开源的块扩散推理模型,并系统研究了训练和推理时的块大小如何影响长链思维推理。研究发现,使用大块大小训练会导致推理性能极差,而小块大小则能保持有效推理。为弥合这一粒度差距,作者提出了块大小课程学习,通过从细粒度到粗粒度块大小的渐进式训练,克服了限制,实现了在不同推理块大小下均表现优异的推理性能。

Details

Motivation: 块扩散语言模型通过并行块去噪加速解码,但其能否可靠地扩展用于长链思维推理仍未解决。本文旨在探索块大小对扩散模型长链推理能力的影响,并寻求提升其性能的方法。

Result: 在数学和代码推理基准测试上,DreamReasoner-8B取得了与领先的开源自回归模型(如Qwen3-8B)相竞争的结果。

Insight: 核心创新点是揭示了训练块大小对扩散模型推理能力的巨大影响(大块性能差),并提出了块大小课程学习策略来系统性地解决此问题,为构建高效且具备推理能力的扩散语言模型奠定了实用基础。

Abstract: Block diffusion language models accelerate decoding through parallel block-wise denoising, yet whether they can be reliably scaled for long chain-of-thought (CoT) reasoning remains unresolved. To this end, we develop DreamReasoner-8B, an open-source block diffusion reasoning model, and conduct a systematic study of how training and inference block sizes affect long-CoT reasoning. Our analysis reveals a stark performance disparity: training with large block sizes yields remarkably poor reasoning, whereas small block sizes preserve effective reasoning. To bridge this granularity gap, we propose block-size curriculum learning, which gradually transitions training from fine-grained to coarse-grained block sizes, thereby overcoming this limitation and enabling strong reasoning performance that generalizes across diverse inference block sizes. On mathematical and code reasoning benchmarks, DreamReasoner-8B achieves results competitive with leading open autoregressive models such as Qwen3-8B. This work establishes a practical foundation for efficient, reasoning-capable diffusion language models. We release our model at https://github.com/DreamLM/DreamReasoner.


[17] Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play cs.CL | cs.MAPDF

Leyang Shen, Yang Zhang, Xiaoyan Zhao, Chun Kai Ling, Tat-Seng Chua

TL;DR: 本文提出了一种名为多智能体虚拟博弈(MAFP)的新范式,旨在解决基于大语言模型(LLM)的多智能体系统(MAS)在处理决策任务时面临的‘立场纠缠’挑战。该方法将利益相关者的立场建模为智能体,并将决策过程形式化为一个寻求均衡的过程,通过迭代更新每个智能体的决策来逐步提升决策质量和鲁棒性。

Details

Motivation: 现有基于LLM的多智能体系统擅长处理执行复杂度高的任务,但在处理现实世界中普遍存在的、涉及多方利益相关者且决策相互依赖的决策任务时存在不足,这种挑战被定义为‘立场纠缠’。

Result: 在测试竞争性场景策略决策能力的挑战性任务上,MAFP在锦标赛强度和鲁棒性两个互补指标上均优于单轮和多轮基线方法,证明了其解决立场纠缠问题的有效性。

Insight: 论文的核心创新点在于将博弈论中的虚拟博弈原理引入多智能体系统,以解决决策任务中的立场纠缠问题。从客观角度看,这为LLM在多智能体协作中处理复杂、相互依赖的决策提供了一种新颖的、基于均衡寻求的框架。

Abstract: Large language model (LLM)-based multi-agent systems (MAS) have demonstrated great potential in solving tasks with execution complexity, by distributing subtasks across cooperative agents. However, this divide-and-conquer paradigm falls short on decision-making tasks that are also prevalent in the real world. These tasks require simultaneous reasoning from the stances of all involved stakeholders whose decisions are mutually dependent and thus cannot be solved in isolation. We characterize this challenge as stance entanglement, a form of decision complexity distinct from execution complexity. To address it, we propose Multi-Agent Fictitious Play (MAFP), a novel MAS paradigm that represents stakeholder stances as agents and formulates decision-making as an equilibrium-seeking process. Built on the game-theoretic principle of fictitious play, MAFP iteratively updates each agent’s decision by best responding to the empirical mixture of other agents’ past decisions. This enables agents to expose and address one another’s weaknesses, progressively improving decision quality and robustness. We evaluate MAFP on challenging decision-making tasks that test the capability of deciding strategies for competitive scenarios prior to acting. MAFP outperforms both single-round and multi-round baselines on two complementary metrics, tournament strength and robustness, demonstrating its effectiveness in addressing stance entanglement.


[18] Learning User Simulators with Turing Rewards cs.CLPDF

Yingshan Susan Wang, Cedegao E. Zhang, Linlu Qiu, Zexue He, Pengyuan Li

TL;DR: 本文提出了一种基于图灵测试的强化学习方法Turing-RL,用于训练用户模拟器模型。该方法通过一个LLM判别器来评估生成回复与真实用户回复的不可区分性,并以此作为奖励训练用户模拟器LLM,使其能产生与用户历史一致的、难以区分的回复。在对话聊天和Reddit论坛讨论两个领域,Turing-RL在LLM和人工评估指标上均优于基线方法。

Details

Motivation: 现有方法通常通过最大化对数概率或使用相似性奖励来训练大语言模型匹配单个真实回复,这存在局限性。本文旨在通过优化不可区分性而非简单的回复匹配,来更有效地学习能够模拟真实人类用户在交互环境中行为的用户模拟器。

Result: 在对话聊天和Reddit论坛讨论两个不同领域的实验中,Turing-RL在LLM和人类评估指标上均持续优于基线方法。

Insight: 核心创新点在于提出了一种基于图灵测试判别奖励的强化学习框架(Turing-RL),将训练目标从匹配单个真实回复转变为优化生成回复与真实用户行为的整体不可区分性。这为构建更逼真的用户模拟器提供了一种新范式,其思想可迁移至需要模拟复杂人类行为的其他交互式AI训练与评估场景。

Abstract: Learning to simulate human users in interactive settings could advance the training of agent assistants, evaluation of personalization systems, research in the social sciences, and more. Existing approaches generally do so by training a large language model (LLM) to match a single ground truth response, either by maximizing the log probability or by using a similarity reward. We instead propose {Turing-RL}: a Turing-Test-based reinforcement learning approach for training user simulator models. {Turing-RL} uses a discriminative Turing reward with an LLM judge to score how indistinguishable a generated response is from the real user’s given the user’s history, and the user simulator LLM learns to produce responses indistinguishable from what the user could have said with such rewards. Across two different domains–conversational chat and Reddit forum discussion–we find that {Turing-RL} consistently outperforms baseline methods on both LLM and human evaluation metrics. Our study suggests that optimizing for indistinguishability, rather than response matching, is effective for learning user simulators.


[19] GraphPO: Graph-based Policy Optimization for Reasoning Models cs.CLPDF

Yuliang Zhan, Xinyu Tang, Jian Li, Dandan Zheng, Weilong Chai

TL;DR: 本文提出了GraphPO,一种基于图的策略优化框架,用于增强大型推理模型的能力。它将推理过程表示为有向无环图,通过合并语义等价的推理路径来避免冗余探索,并设计了一种结合效率和正确性的优势估计方法。

Details

Motivation: 针对现有基于强化学习与可验证奖励(RLVR)范式的两个主要局限:独立采样导致中间推理步骤冗余,以及稀疏的最终答案奖励难以识别有用步骤。树状方法虽能部分缓解,但仍存在分支独立扩展和信息无法共享的问题。

Result: 理论分析表明GraphPO降低了优势估计方差并提升了推理效率。在多个推理和智能体搜索基准测试上,对三种大语言模型的实验表明,在相同token或响应预算下,GraphPO持续优于基于链和树的基线方法。

Insight: 核心创新在于将推理过程建模为图结构,通过语义等价路径合并实现计算资源共享,并设计了区分效率(入边)和正确性(出边)的优势信号,从而更高效地利用监督信号并引导多样化探索。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for enhancing the capability of large reasoning models. RLVR typically samples responses independently and optimizes the policy using from final answers. This paradigm has two limitations. First, independently responses often contain similar intermediate reasoning steps, causing redundant exploration and wasted computation. Second, sparse final-answer rewards make it hard to identify useful steps. Tree-based methods partly address this problem by sharing prefixes and comparing branches from the same prefix to provide fine-grained signals. However, tree branches are still expanded independently. When different branches reach similar reasoning states, they cannot share information and repeat similar exploration. Moreover, tree-based methods ignore such dispersion and only perform local comparisons within separate branches, which can lead to higher variance in advantage estimation. To address this challenge, we propose GraphPO (Graph-based Policy Optimization), a novel RL framework that represents rollouts as a directed acyclic graph, with reasoning steps as edges and semantic states summarized from the reasoning paths as nodes. GraphPO merges semantically equivalent reasoning paths into equivalence classes, allowing them to share suffixes and reallocating budget away from redundant expansions to diverse exploration. Furthermore, we assign efficiency advantages to incoming edges and correctness advantages to outgoing edges, thereby improving inference efficiency while deriving process supervision from outcome. Theory shows that GraphPO reduces advantage-estimation variance and enhances reasoning efficiency. Experiments on three LLMs across reasoning and agentic search benchmarks show that GraphPO consistently outperforms chain- and tree-based baselines with the same token budgets or response budgets.


cs.CV [Back]

[20] Reasoning as Intersection: Consensus-Frame Alignment for Visual Focus in Video-MLLMs cs.CVPDF

Chengwen Liu, Zhe Huang, Jisheng Dang, Hong Peng, Qi Tian

TL;DR: 本文提出了一种无需时间标注的过程级奖励框架CF-GRPO,用于增强视频多模态大语言模型(Video-MLLMs)在推理过程中的视觉证据关注能力。该方法通过构建基于视频内在线索(如时间覆盖、场景转换和查询条件视觉相关性)的共识帧先验,并优化模型侧帧使用分数与其一致性,从而提供高对比度的奖励信号。实验表明,该方法在复杂视频推理基准上取得了有竞争力的性能,并提升了多个指标。

Details

Motivation: 现有强化学习仅使用结果奖励来指导Video-MLLMs,但无法有效引导模型关注支持答案的具体视觉证据。受多感官整合(一致线索能增强感知显著性和可靠性)的启发,本文旨在解决视频推理中过程级视觉证据对齐的指导问题。

Result: 在多个复杂视频推理基准上的实验表明,VideoCFR方法取得了有竞争力的性能,并在多个指标上超越了代表性的Video-MLLM和强化学习基线模型。

Insight: 创新点在于提出了一种无需人工时间标注的过程级奖励框架,通过构建基于视频内在线索的共识帧先验来对齐和优化模型的视觉关注,这为证据感知的视频推理提供了可解释的训练视角和高对比度的奖励信号。

Abstract: Reinforcement learning has improved the reasoning ability of large language models, but applying outcome-only rewards to video multimodal large language models (Video-MLLMs) provides limited guidance on which visual evidence should support the answer. Inspired by multisensory integration, where consistent cues can enhance the salience and reliability of perceptual estimates, we introduce Consensus Frame GRPO (CF-GRPO), a temporal-annotation-free process-level reward framework for evidence-aware video reasoning. CF-GRPO constructs a consensus frame prior from intrinsic video cues, including temporal coverage, scene-transition cues, and query-conditioned visual relevance. It then computes a model-side frame-use score from visual and response representations and optimizes their agreement through the Consensus Frame Reward (CFR). With salience-aware sparse aggregation and distribution sharpening, CFR provides a high-contrast reward signal without requiring human temporal annotations. Experiments show that VideoCFR achieves competitive performance across complex video reasoning benchmarks and improves several metrics over representative Video-MLLM and RL baselines, while the consensus prior provides an interpretable view of the evidence frames emphasized during training. The implementation is available at https://github.com/1Pansy/VideoCFR.


[21] Domain Generalizable Adaptation of 3D Vision-Language Models via Regularized Fine-Tuning cs.CVPDF

Sneha Paul, Zachary Patterson, Nizar Bouguila

TL;DR: 本文提出了ReFine3D,一个用于3D大型多模态模型(LMMs)领域泛化微调的正则化框架。它通过选择性层调优、多视图一致性正则化、基于LLM生成的同义词提示的文本多样性正则化,以及点渲染视觉监督和基于置信度的测试时增强机制,来解决3D视觉-语言模型在数据有限的下游领域微调时面临的过拟合和灾难性遗忘问题。

Details

Motivation: 解决3D多模态基础模型(对齐点云与视觉、文本数据)在数据有限的下游领域进行领域自适应时,容易出现的过拟合和灾难性遗忘这一核心挑战。

Result: 在多个3D领域泛化基准测试上的广泛实验表明,ReFine3D在基础到新类泛化上提升1.36%,跨数据集迁移提升2.43%,对损坏的鲁棒性提升1.80%,少样本准确率最高提升3.11%,以最小的额外计算开销超越了先前的最先进方法。

Insight: 创新点在于将选择性层调优与两种针对性的正则化策略(多视图一致性和基于LLM的文本多样性)相结合,并辅以点渲染视觉监督和测试时增强机制,为3D LMMs的领域泛化微调提供了一个系统性的、计算高效的解决方案。

Abstract: Domain adaptation remains a central challenge in 3D vision, especially for multimodal foundation models that align 3D point clouds with visual and textual data. While these models demonstrate strong general capabilities, adapting them to downstream domains with limited data often leads to overfitting and catastrophic forgetting. To address this, we introduce ReFine3D, a regularized fine-tuning framework designed for domain-generalizable tuning of 3D large multimodal models (LMMs). ReFine3D combines selective layer tuning with two targeted regularization strategies: multi-view consistency across augmented point clouds and text diversity through synonym-based prompts generated by large language models. Additionally, we incorporate point-rendered vision supervision and a test-time augmentation mechanism with confidence-based aggregation to further enhance robustness. Extensive experiments across different 3D domain generalization benchmarks show that ReFine3D improves base-to-novel class generalization by 1.36%, cross-dataset transfer by 2.43%, robustness to corruption by 1.80%, and few-shot accuracy by up to 3.11%, outperforming prior state-of-the-art methods with minimal added computational overhead.


[22] Data-Forcing Distillation: Restoring Diversity and Fidelity in Few-Step Video Generation cs.CVPDF

Siyi Chen, Shaowei Liu, Yixuan Jia, Zian Wang, Huan Ling

TL;DR: 本文提出了一种名为数据强制蒸馏(DFD)的后训练框架,旨在解决视频扩散模型蒸馏方法(如DMD和DMD2)中因反向KL目标导致的样本多样性下降和输出过饱和问题。该方法通过利用教师模型分数差异来引导学生模型逼近真实数据分布,仅需少量微调步骤即可恢复生成视频的多样性和保真度。

Details

Motivation: 现有视频扩散模型蒸馏方法(如DMD/DMD2)在反向KL目标下存在两个持续性问题:样本多样性显著降低,以及生成视频出现偏离真实外观的过饱和伪影。

Result: 在文本到视频、图像到视频和自回归视频生成任务上验证,仅需100-300步微调,DFD在Wan2.1-1.3B和Cosmos-Predict2.5-2B模型上有效恢复了多样性和保真度,解决了过饱和伪影,视频动态和外观显著改善,甚至超越了教师模型。

Insight: 核心创新是利用教师模型分数差异作为引导信号,将学生模型拉向真实数据分布中缺失的模式(缓解模式崩溃)并远离真实数据中不存在的问题模式(避免过饱和),这是一种简单高效的后训练修正策略。

Abstract: Recent progress has shown promise in distilling multi-step video diffusion models into efficient few-step students. Among them, Distribution Matching Distillation (DMD) and its successor DMD2 achieved strong generation quality and fast convergence. However, due to the nature of the reverse Kullback–Leibler (KL) objective, these methods exhibit two persistent failure modes: a substantial drop in sample diversity, and visibly over-saturated outputs that deviate from real-video appearance. In this work, we propose Data-Forcing Distillation (DFD), a simple post-training framework that restores diversity and fidelity in DMD with only a single-line of code change. At its core is the teacher score discrepancy to guide the student toward the real-data distribution, pulling it to missing modes (mitigating mode collapse) and away from problematic modes absent in real data (avoiding over-saturation). We provide an in-depth theoretical analysis of our framework and validate our approach on text-to-video, image-to-video, and autoregressive video generation. With only 100–300 steps of finetuning, DFD effectively restores diversity and fidelity on both Wan2.1-1.3B and Cosmos-Predict2.5-2B model, resolving the over-saturation artifacts with significantly better video dynamics and appearance, and even outperforms the teacher model.


[23] Architectural Bias in Face Presentation Attack Detection: A Comparative Study of Vision Transformers and Convolutional Neural Networks cs.CV | cs.CRPDF

Ngela Landon Ntung, Floride Tuyisenge, Jema David Ndibwile

TL;DR: 本文通过对比实验研究了视觉Transformer(ViT)和卷积神经网络(CNN)在人脸呈现攻击检测(PAD)任务中的性能与公平性差异。研究发现,在CASIA-SURF CeFA跨种族数据集上,预训练的DeiT-S模型不仅取得了最高的整体准确率(97.27%)和最低的等错误率(0.86%),还显著减小了不同种族群体间的性能差距,并在零样本的中亚群体上展现出更强的泛化能力。

Details

Motivation: 现有的人脸呈现攻击检测系统在不同人口统计群体(尤其是肤色较深的群体)中存在系统性性能差异,导致公平性问题。本文旨在探究Vision Transformer架构是否比传统的卷积神经网络基线更能减少这种人口统计偏差。

Result: 在CASIA-SURF CeFA数据集上,预训练的DeiT-S模型取得了97.27%的整体准确率和0.86%的EER,优于ResNet18(90.15%准确率)。在公平性方面,DeiT-S将非洲和东亚群体间的ACER差距降至0.13%,比基于LBP的工作减少了83%。在零样本的中亚群体上,DeiT-S的BPCER为2.89%,远低于ResNet18的10.44%,展现出3.6倍的泛化优势。

Insight: 论文宣称的创新点在于首次系统比较了ViT与CNN在人脸PAD任务中的架构偏差,并发现预训练的Vision Transformer在准确率、跨群体公平性和对未见群体的泛化能力上均优于CNN。从客观角度看,这为通过模型架构设计(如采用预训练Transformer)来缓解生物识别系统中的公平性问题提供了新的实证依据和思路。

Abstract: Face Presentation Attack Detection (PAD) systems constitute a critical security layer in biometric authentication; however, existing approaches exhibit systematic performance disparities across demographic groups, disproportionately affecting individuals with darker skin tones. This paper presents a comparative empirical investigation of whether Vision Transformer architectures reduce demographic bias in face PAD systems relative to convolutional baselines. Experiments are conducted on the CASIA-SURF Cross-Ethnicity Face Anti-Spoofing (CeFA) dataset. Three architectures are evaluated: a Multimodal ViT-Tiny trained from scratch, a ResNet18 CNN baseline, and a pretrained DeiT-S fine-tuned on CeFA across African, East Asian, and zero-shot Central Asian demographic groups. DeiT-S achieves the highest overall accuracy of 97.27% and the lowest EER of 0.86%, outperforming ResNet18 at 90.15% accuracy. In terms of fairness, DeiT-S reduces the inter-ethnic ACER gap between African and East Asian subjects to 0.13%, compared to 0.75% reported in an LBP-based work [6], representing an 83% reduction. Most notably, while ResNet18 records a BPCER of 10.44% on zero-shot Central Asian subjects, DeiT-S maintains 2.89% on the same unseen group, demonstrating a 3.6x generalization advantage. These results suggest that pretrained Vision Transformers achieve superior PAD accuracy, produce smaller demographic performance gaps, and generalize more equitably across unseen demographic groups, indicating that cross-demographic fairness in PAD may partly be influenced by architectural design.


[24] Rethinking Text-to-Image as Semantic-Aware Data Augmentation for Indoor Scene Recognition cs.CVPDF

Trong-Vu Hoang, Quang-Binh Nguyen, Dinh-Khoi Vo, Hoai-Danh Vo, Minh-Triet Tran

TL;DR: 本文提出了一种利用Stable Diffusion(SD)生成合成图像作为数据增强工具的新方法,以解决室内场景识别中训练数据不足的问题。该方法通过合成多样且真实的室内场景图像来丰富训练数据,提升模型在有限真实数据下的性能。此外,为防止SD合成图像的滥用,作者引入了基于扩散重建误差(DIRE)的对抗措施,使轻量级模型如MobilenetV3能完美识别SD生成的图像。

Details

Motivation: 室内图像识别面临光照条件、遮挡和物体布局复杂等挑战,且训练数据缺乏,因此作者旨在利用SD生成合成图像作为数据增强手段,以增强模型的鲁棒性。

Result: 在MIT Indoor Scene数据集上的实验表明,该方法能有效提升深度模型的训练效果;同时,基于DIRE的对抗措施使MobilenetV3对SD生成图像的识别准确率达到100%。

Insight: 创新点在于将文本到图像生成(SD)重新构想为语义感知的数据增强工具,用于室内场景识别,并引入DIRE作为对抗滥用合成图像的有效方法,这为数据稀缺场景下的模型训练提供了新思路。

Abstract: In the realm of computer vision, indoor image recognition presents challenges due to the intricate interplay of lighting conditions, occlusions, and diverse object arrangements within confined spaces. To address the lacks of training indoor images, we introduce a novel approach leveraging Stable Diffusion (SD) for the generation of synthetic images, which serve as a powerful data augmentation tool. The utilization of SD offers a principled framework for synthesizing diverse and realistic indoor scenes, thereby enriching the training data pool for robust indoor image recognition models. Experimental findings on the MIT Indoor Scene dataset reveal the potential of our proposed approach in enhancing the training of deep models when authentic data is limited. Furthermore, to prevent the misuse of SD synthetic images, we introduce a counter measure based on DIffusion Reconstruction Error (DIRE). The powerful DIRE presentation enables training robust classifiers only using lightweight deep models. Experiments show that our approach can perfectly recognize SD generated images with the accuracy of 100% using MobilenetV3.


[25] Hierarchical Multi-Modal Retrieval for Knowledge-Grounded News Image Captioning cs.CVPDF

Minh-Loi Nguyen, Xuan-Vu Le, Long-Bao Nguyen, Hoang-Bach Ngo, Trung-Nghia Le

TL;DR: 本文提出了一种新颖的检索增强的图像描述生成框架,旨在为新闻图片生成包含深层洞察(如物体属性、事件背景和潜在意义)的全面描述。该方法的核心是一个分层多模态文章检索机制,该机制超越了单一文本实体,考虑了文章结构感知特征和多方面相似性计算。检索到的文章作为知识库,通过视觉语言模型生成初步描述,再从中分割相关信息,最后由大语言模型整合生成最终详细的描述。

Details

Motivation: 传统图像描述方法难以生成全面、富含上下文的描述,特别是对于无法直接从视觉线索观察到的细节。因此,需要利用外部知识来增强描述生成,以提供更深层次的洞察。

Result: 该方法在ACM Multimedia EVENTA 2025挑战赛中,在OpenEvent-V1数据集的私有测试集上取得了第五名,总分为0.2824。

Insight: 创新点在于提出了分层多模态文章检索机制,该机制结合了文章结构感知特征(如加权文本组件和视觉布局模式)和多方面相似性计算,以获取更相关的外部知识。随后通过VLM和LLM的协同工作流程,将视觉描述与检索到的知识有效融合,生成上下文丰富的详细描述。

Abstract: Traditional image captioning methods often struggle to generate comprehensive, context-rich descriptions, especially for details not directly observable from visual cues. To overcome this, we propose a novel retrieval-augmented image captioning framework that generates captions with deeper insights, such as object attributes, event context, and underlying significance, by leveraging external knowledge. Our approach features a hierarchical multi-modal article retrieval mechanism that moves beyond monolithic text entities. This retrieval considers article structure-aware features, including weighted textual components (e.g., headlines, body sections) and visual placement patterns, alongside multi-faceted similarity computations (content–visual, visual–visual, and discourse positioning). A subsequent contextual relevance refinement stage further enhances the retrieved information. The retrieved articles then serve as the knowledge base for caption generation: first, a VLM generates a concise image description; second, we segment relevant information from the retrieved articles based on this description; and finally, an LLM utilizes both the description and extracted knowledge to generate a comprehensive, contextually detailed caption. We participated in the ACM Multimedia EVENTA 2025 Challenge and achieved 5th place with an overall score of 0.2824 on the private test set of the OpenEvent-V1 dataset. Source code is publicly released at https://github.com/mf0212/EVENTA-Challange.


[26] MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction cs.CVPDF

Jianing Zhang, Chenhao Zheng, Yajun Yang, Max Argus, Rustin Soraki

TL;DR: 本文提出了MolmoMotion,一个用于语言指令条件下的3D点轨迹预测的通用框架。该工作包含一个大规模数据集MolmoMotion-1M、一个评估基准PointMotionBench以及一个支持自回归坐标预测和基于流匹配的轨迹生成的预测模型。模型能够根据不同的语言指令准确预测多样的运动模式,并在下游机器人操作和视频生成任务中展现出良好的迁移能力。

Details

Motivation: 运动预测是视觉智能的核心,智能体需要预测物体如何运动以进行动作规划、物理交互推理和合成逼真的未来场景。作者认为,世界坐标系中的3D点提供了一种类别无关、视角稳定、紧凑且对下游任务直接有用的通用表示。

Result: MolmoMotion模型在PointMotionBench基准测试(涵盖111个物体类别和61种运动类型)上显著优于现有的运动预测基线方法。

Insight: 创新点在于将语言指令、3D点表示与运动预测任务相结合,提出了一个完整的、可扩展的研究栈(数据集、基准和模型)。其学习的3D运动先验知识能够有效迁移到机器人操作(提升训练效率和泛化能力)和生成模型(为合成具有更真实物体运动的视频提供运动指导)等下游应用中。

Abstract: Motion forecasting is central to visual intelligence: agents must anticipate how objects will move in order to plan actions, reason about physical interactions, and synthesize realistic futures. We argue that 3D points in world coordinates provide a general representation that is class-agnostic, view-stable, compact, and directly useful for downstream tasks. We formalize the task of goal-conditioned 3D point motion forecasting: given a short visual history, a set of 3D query points on an object of interest, and a language description of the intended goal, the model predicts the future 3D trajectory of each point. We introduce a full stack to study this task at scale: (1) MolmoMotion-1M is a large corpus of action-described, object-grounded 3D point trajectories annotated from 1.16M unconstrained videos; (2) PointMotionBench is a human-verified benchmark spanning 111 object categories and 61 motion types; and (3) MolmoMotion is a general motion forecasting model that supports both autoregressive coordinate prediction and flow-matching-based trajectory generation. MolmoMotion accurately predicts diverse motion patterns with different language instructions, and significantly outperforms existing motion prediction baselines on PointMotionBench. Finally, we show that the learned 3D motion prior transfers well to downstream applications: it improves training efficiency and generalization for robot manipulation, and its predicted trajectories provide effective motion guidance for generative models to synthesize videos with more realistic object motion.


[27] Multi-Modal Hyper-Graph Fusion for Low-Light Crowd Counting cs.CV | cs.AI | cs.GRPDF

Hao-Yuan Ma, Li Zhang, Yushi Qiu, Jie Gao, Yan Zhang

TL;DR: 本文针对低光照人群计数问题,提出了一个统一的低光照计数网络(LCNet)。该方法通过引入深度和Canny边缘线索作为互补的几何与结构先验,并设计了多模态超图融合模块来建模RGB外观、深度几何和边缘结构之间的高阶互补关系,同时采用可变形矩形稀疏注意力模块自适应地分配计算资源。

Details

Motivation: 解决现有方法在低光照和复杂非均匀照明环境下,依赖单一RGB模态导致不可靠的问题,填补低光照人群计数研究空白。

Result: 在构建的三个低光照人群计数基准(两个合成数据集SHA_Dark、SHB_Dark和一个真实世界数据集LC-Crowd)上进行了广泛实验,结果表明所提方法相比现有最先进方法取得了最佳整体性能。

Insight: 创新点包括:引入深度和边缘作为互补先验增强反射率表示;提出多模态超图融合模块显式建模多模态高阶关系;设计可变形矩形稀疏注意力模块实现自适应计算分配。

Abstract: Crowd counting is a fundamental task in computer vision. However, crowd counting in low-light environments remains largely underexplored, despite its practical importance in the real world. Existing methods mainly focus on well-lit scenes or rely on single-modality Red-Green-Blue (RGB) representations, which often become unreliable under extreme darkness and complex non-uniform illumination. To handle this problem, we construct three new low-light crowd counting benchmarks, which consist of two synthetic datasets, SHA_Dark and SHB_Dark, and a real-world benchmark LC-Crowd (Low-light Crowd Dataset). Inspired by Retinex-based physical modeling, we introduce depth and Canny edge cues as complementary geometric and structural priors to enhance the intrinsic reflectance representation under low-light conditions. We propose a Multi-Modal Hyper-Graph Fusion module, which formulates RGB appearance, depth geometry, and edge structure cues as nodes in a unified hyper-graph and explicitly captures their high-order complementary relationships via dynamic hyperedge construction and message passing. Furthermore, to adaptively allocate computation in dense prediction, we propose a Deformable Rectangular Sparse Attention (DRSA) module, which concentrates computation on informative regions through anchor-aware estimation and adaptive rectangular window modeling. Based on these designs, we develop a unified Low-Light Counting Network (LCNet) for robust low-light crowd counting. Extensive experiments on three benchmarks demonstrate that the proposed method achieves the best overall performance against existing state-of-the-art (SOTA) methods. The code is in the supplementary material. The datasets will be made public upon acceptance.


[28] APT: Atomic Physical Transitions for Causal Video-Language Understanding cs.CV | cs.AIPDF

Shang Wu, Haoran Lu, Songling Liu, Chenwei Xu, Lie Lu

TL;DR: 该论文提出了原子物理转变(APTs)的概念,用于因果视频-语言理解,将物理事件分解为最小化的、时间局部化的状态变化序列。作者构建了一个混合来源的APT数据集,并发现当前视觉语言模型(VLMs)在转变级物理理解上存在不足。为了解决微调导致的遗忘问题,论文提出了参数高效的APT-Tune方法,该方法在提升APT检测的同时,改善了事件级视频理解能力。

Details

Motivation: 动机在于解决当前视频理解模型仅依赖聚合事件标签(如“弹跳”)而忽略其背后因果物理过程的问题。论文旨在通过引入APTs来显式地建模构成物理事件的最小因果状态转变,从而解释事件“为什么”发生,而不仅仅是“发生了什么”。

Result: 在构建的APT数据集上,当前VLMs的零样本召回率最高仅为14%,错误主要由遗漏转变导致。提出的APT-Tune方法仅在Qwen3-VL-2B模型上使用1100万LoRA参数进行微调,显著提升了APT召回率,并同时改善了事件级视频迁移任务的性能。

Insight: 创新点在于提出了原子物理转变(APTs)作为因果监督信号,以及APT-Tune这一参数高效的微调方案。该方法通过图像-填充感知监督、格式条件协同训练和机制条件域到类型解码,使模型学习可重用的物理表征,而非仅仅适应特定答案格式,从而实现了物理理解能力的提升。

Abstract: Physical events are not understood by their names alone, but by the causal state changes that compose them. A clip-level label such as “bounce” can be correct while hiding the process that makes the event physically valid, from support loss and contact onset to rebound and settling. To make this hidden process explicit, we introduce Atomic Physical Transitions (APTs): minimal, temporally localized state changes that bind a visible cue to an active physical mechanism and before/after dynamical regimes. An APT chain represents a video as an ordered causal transition sequence rather than a single aggregate event label: event labels tell what happened; APT chains explain why it happened. To make APTs learnable by VLMs, we construct mixed-source APT data from human annotations and simulator ground truth, covering 14 transition types across contact, gravity, friction, and rotation/stability, with 27,303 timed instances over 1,246 trials. Using this data, we find that current VLMs miss transition-level physics, with zero-shot recall at most 14% and errors dominated by missed transitions. Direct fine-tuning on APT chains improves transition detection but causes event-level forgetting, indicating that the model learns a specialized answer format rather than a reusable physical representation. We therefore propose APT-Tune, a parameter-efficient recipe that teaches VLMs to use causal transitions without forgetting how to answer video questions. It combines image-pad-aware supervision, format-conditional co-training, and mechanism-conditioned domain-to-type decoding to make APT learning format-robust and physically grounded. With only 11 M LoRA parameters on Qwen3-VL-2B, APT-Tune substantially improves APT recall while also improving event-level video transfer. These results show that APTs are not a new answer format, but a human-aligned causal supervision signal for physical video understanding.


[29] Bridging Creative Intent and Visual Quality: Creator-Driven Recurrent Video Generation with Agentic Feedback Loops cs.CVPDF

Denis Savytski, Aiden Lei, Heding Liu, Warren Yang, Sihan Liang

TL;DR: 本文提出了一种名为CHIEF的人机协同视频生成框架,旨在通过将创作者置于循环迭代的视频精炼过程中,并利用基于角色条件的多模态LLM提供主观反馈,以解决AI生成视频在叙事连贯性和创意方向上的不足。该框架支持从1分钟短片到10分钟复杂情节短片的创作,并通过与无电影制作经验的学生合作进行了验证。

Details

Motivation: 当前AI生成的视频往往缺乏叙事连贯性和创意方向,尤其是在较长时长下问题更为突出。与编码任务不同,视频生成需要关于情节、场景和叙事的主观反馈,这促使研究者探索如何将人类创意指导融入生成过程。

Result: 通过与无电影制作经验的高中和大学生合作,CHIEF框架成功支持了从1分钟短片到10分钟复杂情节短片的创作,验证了其在提升视频叙事质量和创意表达方面的有效性。

Insight: 创新点包括:将创作者置于人机循环迭代精炼的核心位置,利用角色条件多模态LLM从观众视角生成主观批评反馈,以及通过专门的精炼代理整合创作者的修订,从而弥补了仅靠自我评估的不足。

Abstract: Generative AI has made content creation increasingly accessible, but many AI-generated videos lack narrative coherence and creative direction, issues that become more substantial at longer durations. Unlike coding, where AI generation benefits from reliable feedback and techniques such as recurrent self-improvement, video generation requires subjective feedback about plot, scenes, and narrative, which naturally motivates approaches that incorporate human creative direction. We introduce CHIEF, a human-AI co-creation video generation framework that places the creator at the center of human-in-the-loop iterative video refinement, and supports them by providing automatic subjective feedback. The creator incorporates their creative direction by driving each iteration, while their revisions are incorporated by a specialized refiner agent. The feedback loop is generated by persona-conditioned multimodal LLMs that watch generated videos and produce subjective critique from the audience perspectives, providing feedback that self-evaluation alone cannot capture. To test the effectiveness of our proposed framework, we work with high school and college students with no prior filmmaking experience to create videos, from short 1-minute videos to a complete short 10-minute film with a complicated plot.


[30] Hallucination Detection and Correction in Medical VLMs via Counter-Evidence Verification cs.CVPDF

Nan Zhou, Ke Zou, Meng Liu, Linchao He, Jiaqi Zhu

TL;DR: 本文提出了一种名为CoEV(Counter-Evidence Verification)的免训练即插即用框架,用于检测和纠正医学视觉语言模型(VLMs)中的幻觉问题。该方法通过双向验证文本断言与视觉证据之间的事实一致性,将陈述分类到一个四象限诊断图中,从而识别幻觉内容并作为后处理工具进行修正。

Details

Motivation: 医学诊断中视觉语言模型的可靠性受到幻觉问题的挑战,现有方法主要关注生成文本与参考数据的事实不一致性,而很少验证模型的视觉注意力是否真正反映了支持生成文本的证据。本文旨在填补这一空白。

Result: 在四个医学数据集上的实验表明,CoEV在幻觉检测方面持续优于现有方法,平均PR-AUC和ROC-AUC分别绝对提升了3.0%和3.9%,在特定VQA场景中增益高达18.5%。在幻觉纠正方面,它将Micro-F1提升了高达12.5%,在医学报告生成任务中将幻觉率降低了超过11.9%,并提高了医学VQA的准确率。

Insight: 创新点在于提出了一个免训练的双向证据验证框架,通过构建一个结合文本事实性和视觉基础的四象限诊断图来系统性地检测和纠正幻觉。该方法可作为即插即用的后处理工具,无需重新训练模型,为临床诊断提供了可靠的、基于证据的线索。

Abstract: Vision-Language models (VLMs) reliability in medical diagnosis is challenged by trust-undermining hallucinations. Existing hallucination detection approaches mainly focus on identifying factual inconsistencies between generated text and reference data. While some studies analyze where models attend in images, they seldom verify whether such attention truly reflects the visual evidence supporting the generated text. To address this gap, we propose Co}unter-Evidence Verification (CoEV), a training-free plug-and-play framework that detects and corrects hallucinations through evidence-based factual consistency verification. CoEV performs bidirectional verification between textual assertions and visual evidence, testing whether each statement is supported by its corresponding evidence region, and assigns each statement into a four-quadrant diagnostic map capturing combinations of text factuality and visual grounding. CoEV detects hallucinated content and serves as a post hoc refinement tool, correcting hallucinations without retraining. Extensive experiments on four medical datasets show that CoEV combats hallucinations in VLMs.For hallucination detection, CoEV consistently outperforms existing methods, improving average PR-AUC and ROC-AUC by 3.0% and 3.9% absolute points respectively, with notable gains of up to 18.5% in specific VQA scenarios. For hallucination correction, it improves Micro-F1 by up to 12.5%, reduces hallucination rates by over 11.9% on medical report generation, and also boosts medical VQA accuracy. These results show that CoEV enables reliable detection and correction of hallucinations, providing clinicians with dependable, evidence-based cues for diagnosis. Code will be released upon acceptance.


[31] Intrinsic 4D Gaussian Segmentation from Scene Cues cs.CV | eess.IVPDF

Hasan Yazar, Mohamed Rayan Barhdadi, Erchin Serpedin, Mehmet Tuncel, Hasan Kurban

TL;DR: 本文提出了一种名为Intrinsic-GS的无训练、无掩码监督的4D高斯场景分割方法。该方法通过构建基于高斯图元外观、方向、尺度、变形轨迹和渲染边界线索的稀疏亲和力图,并利用Leiden社区检测算法进行分割,无需依赖外部基础模型(如SAM)或学习特征场。

Details

Motivation: 现有动态4D高斯场景分割方法依赖从SAM等基础模型导入的2D掩码,并将其提升或蒸馏到高斯表示中,这在多帧多视图下成本高昂且分割质量受外部掩码一致性的影响。本文旨在探索仅从高斯图元本身恢复对象级结构的可能性。

Result: 在标准4D高斯分割基准测试Neu3D和HyperNeRF上,Intrinsic-GS在无掩码监督下分别达到0.746和0.575的mIoU;在Neu3D上,仅使用几何线索的变体达到0.902 mIoU,与SAM监督的TRASE方法相当。在HyperNeRF上,其运行速度比掩码监督流程的掩码生成和特征渲染阶段快12.5倍。

Insight: 创新点在于完全利用高斯图元固有的几何与动态属性(如变形轨迹)进行分割,避免了对外部掩码的依赖。这为3D/4D高斯分割提供了一条快速、无需掩码的新方向,并在外部掩码不可靠或昂贵的场景中可能实现更通用、鲁棒的分割。

Abstract: Dynamic 4D Gaussian Splatting reconstructs deforming scenes with high fidelity and is increasingly adopted as a representation for dynamic 3D scenes. Putting such a scene to use, for editing, manipulation or motion analysis, first requires segmenting it: grouping the Gaussian primitives into coherent objects. Current pipelines obtain this grouping by importing 2D masks from foundation models such as SAM and lifting or distilling them into the Gaussian representation. In dynamic scenes these masks must be generated across many frames and views, which is costly, and the resulting segmentation can depend strongly on the quality and consistency of those external masks. We ask how much object-level structure can instead be recovered from the Gaussians themselves, and propose Intrinsic-GS, a training-free, mask-free method that builds a sparse affinity graph over Gaussian primitives from appearance, orientation, scale, deformation-trajectory and non-learned rendered-boundary cues. The graph is partitioned with Leiden community detection, requiring no foundation model and no learned feature field. On the standard 4D Gaussian segmentation benchmarks, Neu3D and HyperNeRF, Intrinsic-GS recovers substantial object structure without mask supervision, reaching 0.746 mIoU on Neu3D and 0.575 on HyperNeRF; on Neu3D, a geometry-only variant reaches 0.902 mIoU, matching SAM-supervised TRASE. On HyperNeRF, Intrinsic-GS runs 12.5x faster than the mask-generation and feature-rendering stages used by mask-supervised pipelines. These results suggest that much of the segmentation signal is already encoded in the Gaussians themselves, offering a fast, mask-free direction for 3D and 4D Gaussian segmentation that may also point toward more generalizable, robust segmentation in settings where external masks are unreliable or expensive.


[32] Spiking Pyramid Wavelet Transformation for High-efficient and Low-energy Image Restoration cs.CVPDF

Chen Zhao, Xiantao Hu, Song Wu, Qian Wang, Chen Wu

TL;DR: 本文提出了一种基于脉冲金字塔小波变换(SPWM)的高效低能耗图像恢复方法。该方法通过引入脉冲双金字塔小波(SDPW)模块来建模长程依赖关系,并利用小波域中的退化特性。实验表明,SPWM在保持图像质量的同时,显著降低了计算成本和能耗。

Details

Motivation: 现有基于脉冲CNN的图像恢复方法受限于CNN操作的固有感受野限制,性能提升受限。本文旨在探索离散小波变换的优势,以解决长程依赖建模问题,实现高效低能耗的图像恢复。

Result: 在多个基准测试上的实验结果表明,SPWM在保持图像质量的同时,显著降低了计算成本和能量消耗,展示了SNN在图像恢复领域的潜力。

Insight: 主要创新点在于将离散小波变换与脉冲神经网络结合,设计了SDPW模块来有效建模长程依赖并利用小波域特性。这为资源受限设备上的图像恢复应用提供了新的思路,强调了结构设计(如金字塔小波)对提升SNN性能的重要性。

Abstract: Spiking neural networks (SNNs) have garnered significant interest in computer vision due to their potential for efficiency and biological inspiration. While spiking CNN-based methods have shown promise for image restoration (IR) tasks, their performance is constrained by the inherent receptive field limitations of CNN operations. In the paper, we explore the benefits of discrete wavelet transformation and propose a spiking pyramid wavelet-based model (SPWM) for high-efficient and low-energy target. Specifically, we develop a spiking dual pyramid wavelet (SDPW) block to model long-range dependency and exploit the properties of the degradation in the wavelet domain. Experimental results on several benchmarks demonstrate that SPWM significantly lowers computational costs and energy consumption while maintaining image quality. Our method showcases the potential of SNNs in the field of IR, offering new insights for future applications of resource-limited devices.


[33] LandslideAgent with Multimodal LandslideBench: A Domain-Rule-Augmented Agent for Autonomous Landslide Identification and Analysis cs.CV | cs.AIPDF

Chengfu Liu, Dongyang Hou, Junwu Xiang, Cheng Yang, Xuezhi Cui

TL;DR: 本文提出了一个用于滑坡灾害智能识别的指令驱动智能体框架,包含三个核心组件:一个包含多模态细粒度标注的滑坡数据集LandslideBench,一个基于该数据集微调的滑坡专用视觉语言模型LandslideVLM,以及一个以LandslideVLM为认知核心、结合领域规则约束的智能体LandslideAgent,旨在实现滑坡识别与分析的全流程自动化。

Details

Motivation: 当前滑坡灾害智能解译方法难以同时提取视觉特征和高级地质科学语义,而通用视觉语言模型在复杂地质场景中存在感知局限和领域幻觉问题,因此需要构建一个能融合视觉与领域知识的自主识别与分析系统。

Result: 在构建的LandslideBench数据集上,五个主流模型在细粒度分类和语义分割任务上提供了有效基线。LandslideVLM在滑坡判别、细粒度分类和语义描述质量上的准确率分别提升了10.96%、32.87%和15.91%。LandslideAgent实现了对多源空间数据的自主推理。

Insight: 创新点在于构建了一个高质量的多模态滑坡基准数据集,并通过领域规则增强的智能体架构,将结构化报告元数据约束和交叉验证识别约束融入工具调用流程,从而将领域知识系统性地整合到自主决策中,提升了模型在专业场景下的可靠性和自动化水平。

Abstract: Intelligent landslide hazard interpretation is critical for disaster prevention, yet current paradigms struggle to simultaneously extract visual features and high-level geoscientific semantics, while general-purpose vision-language models (VLMs) suffer from perceptual limitations and domain hallucinations in complex geological scenarios. To address these challenges, we propose an instruction-driven agentic framework comprising three components. First, LandslideBench, a multimodal fine-grained dataset with seven subtype labels, high-resolution imagery, pixel-level masks, and high-quality textual descriptions, is constructed via multi-VLM cross-validation and interactive annotation. Then, LandslideVLM, a landslide-oriented VLM, is fine-tuned via LoRA on LandslideBench to enhance geological semantic understanding. Finally, LandslideAgent, a domain rule-enhanced agent taking LandslideVLM as its cognitive backbone, employs a dual-rule controller incorporating structured report metadata constraints and cross-validation identification constraints to regulate automated tool invocation. Experiments demonstrate that LandslideBench provides effective baselines across five mainstream models on fine-grained classification and semantic segmentation. LandslideVLM achieves accuracy improvements of 10.96%, 32.87%, and 15.91% on landslide discrimination, fine-grained classification, and semantic description quality, respectively. LandslideAgent further enables autonomous multi-source spatial data inference, realizing full-process intelligence for landslide identification and analysis.


[34] Moving Beyond Diversity: Visual Token Pruning as Subspace Reconstruction for Efficient VLMs cs.CVPDF

Jaeyeon Lee, Shunjie Wen, Dong-Wan Choi

TL;DR: 本文提出SPARE方法,通过子空间重构重新定义视觉语言模型中的token剪枝问题,将其视为列子集选择问题,并引入反相关性准则进行上下文感知的token选择,在多个基准测试中实现SOTA性能。

Details

Motivation: 现有基于余弦相似度的多样性最大化token剪枝方法丢弃了幅度信息,无法准确近似原始特征表示,导致在组合式多技能推理任务上性能欠佳。

Result: 在多个VLM和基准测试中,SPARE一致达到SOTA性能,尤其在组合任务上提升显著;应用于LLaVA时,可剪除高达94%的视觉token同时保持95%的基线性能,且无需训练。

Insight: 创新点在于将token剪枝重构为子空间重建问题以最小化重建误差,并发现反相关性现象(低图像-文本相关性的token更能保留上下文信息),将其作为额外选择准则。

Abstract: Despite their remarkable performance, Vision Language Models (VLMs) incur substantial computational overhead due to the large number of visual tokens. While diversity maximization has become a dominant strategy for token reduction, existing methods rely on cosine-based normalized similarity that discards magnitude information, failing to faithfully approximate the original feature representation and leading to suboptimal performance, particularly on compositional multi-skill reasoning tasks. In this paper, we introduce SPARE, a subspace reconstruction method that reformulates token pruning as a column subset selection problem and explicitly minimizes reconstruction error. By iteratively selecting tokens with large projection residuals, SPARE performs reconstruction-driven pruning beyond angular diversity. Moreover, we reveal a counterintuitive anti-relevance phenomenon: tokens with lower image-text relevance score can better preserve contextual information. Based on this finding, we incorporate anti-relevance into SPARE as an additional selection criterion to promote context-aware token selection. Extensive experiments across multiple VLMs and benchmarks demonstrate that SPARE consistently achieves state-of-the-art performance, with strong gains on compositional tasks. When applied to LLaVA, SPARE removes up to 94% of visual tokens while retaining 95% of the baseline performance, all in a fully training-free manner.


[35] BrainFusionNet: a deep learning and XAI model to understand local, global, and sequential features of MRI images for improved brain tumour detection cs.CVPDF

Md Taimur Ahad, Bo Song, Yan Li

TL;DR: 本文提出了一种名为BrainFusionNet的深度学习模型,用于改进脑肿瘤检测。该模型融合了CNN、ViT和GRU,以从MRI图像中提取局部、全局和序列特征,并集成了可解释AI(XAI)技术来可视化决策过程。在两个公开MRI数据集上的评估显示,模型达到了98%的准确率,优于多个SOTA CNN模型。

Details

Motivation: 解决MRI图像中噪声、肿瘤边界模糊以及位置和外观复杂性问题,这些挑战使得深度学习在脑肿瘤检测中效果受限,因此需要开发能有效提取多尺度特征的混合模型。

Result: 在公开MRI数据集上使用K折验证,准确率达到98%,优于DenseNet121和VGG16等SOTA CNN模型(最高96%),表明模型性能达到SOTA水平。

Insight: 创新点在于提出了一种平衡的混合架构,结合CNN提取低层和深层特征,定制ViT捕获局部特征并稳定梯度流,再通过GRU进行最终分类;同时,研究发现MRI图像的像素强度分布会影响深度学习性能,这在图像解释方面具有新颖性。

Abstract: The noise of Magnetic Resonance Imaging MRI poses challenges for Deep Learning DL when tumor boundaries are obscured tumor location and appearance are complex Therefore we develop BrainFusionNet that combines Convolutional Neural Networks CNNs Vision Transformers ViT and Gated Recurrent Units GRUs to extract spatial contextual and sequential features from MRI images for improved brain tumor classification Furthermore explainable AI such as SHAP LIME and GradCAM are integrated to visualise and highlight image regions that contribute to BrainFusionNets decisionmaking process The proposed BrainFusionNet model is evaluated on two publicly available MRI datasets Kfold validation suggests 98 accuracy on both datasets The model was compared with the six stateoftheart SOTA CNNs and transfer learning Among the SOTA CNNs DenseNet121 and VGG16 achieved the highest accuracy of 96 The novelty of BrainFusionNet is that the hybrid model effectively extracts local and global features from MRI images even in smallscale tumor regions and small tumor sizes The model has a balanced sequential CNN architecture to capture lowlevel and deeperlayer features a customized ViT that captures local features stabilizes gradient flow and reduces the risk of vanishing gradients during MRI image training The CNN and ViT outputs are fed into a GRU for final classification Furthermore we analyze pixel intensities to determine whether MRI image quality affects image classification Our findings are very novel in image interpretation as we found that the distribution of pixel intensities in MRI images affects DL performance


[36] UniTemp: Unlocking Video Generation in Any Temporal Order via Bidirectional Distillation cs.CVPDF

Lin Zhang, Sicheng Mo, Zefan Cai, Jinhong Lin, Zihao Lin

TL;DR: UniTemp提出了一种双向蒸馏框架,用于训练支持任意时序方向生成的自回归视频扩散模型,解决了现有方法仅限于前向生成的问题。该方法通过引入块级锚点潜在变量来修复反向生成时的块间不连续性,从而支持基于任意过去和/或未来帧的条件生成。

Details

Motivation: 现有自回归视频扩散模型仅限于前向时序生成,而实际视频创作(如向后扩展、中间帧生成)需要灵活的生成顺序,论文旨在填补这一空白。

Result: 实验表明,UniTemp在短/长视频生成上保持了与仅前向方法相当的性能,同时支持双向扩展、中间生成、循环视频生成、场景转换和视觉故事生成等多种工作流。

Insight: 核心创新是提出了块级锚点潜在变量以解决因果3D VAE在反向生成时的上下文缺失问题,并通过双向蒸馏训练单一模型实现任意方向生成,增强了视频生成的灵活性和可控性。

Abstract: Autoregressive video diffusion models have emerged as a promising approach for long video generation, achieving strong performance in streaming settings. However, existing methods are restricted to forward temporal generation, whereas practical video creation often requires flexible generation order, e.g., conditioning on future context to extend backward, or on both past and future context for inbetween generation. We bridge this gap by training an autoregressive model that supports generation in arbitrary temporal directions. A key technical challenge arises from the Causal 3D VAE widely used in video diffusion models, which encodes latents strictly conditioned on past context. While suited for forward generation, this causal structure causes inter-block discontinuities when generation proceeds backward. To address this, we introduce blockwise anchor latents, a set of auxiliary latents that restore the missing past context at block boundaries during backward generation. Built on this design, we propose UniTemp, a bidirectional distillation framework that trains a single autoregressive student model for any-direction video generation. At inference time, UniTemp conditions on arbitrary past and/or future frames, improving controllability for both bidirectional and inbetween generation. Experiments show that UniTemp maintains competitive performance on short and long video generation compared to forward-only methods, while enabling diverse workflows such as bidirectional video extension, inbetween generation, looping video generation, scene transition, and visual story generation. Project website: https://lzhangbj.github.io/projects/unitemp/


[37] Toward Training-Free Zero-Shot Anomaly Detection in 3D Medical Images: A Batch-Based Approach Using 2D Foundation Models cs.CVPDF

Tai Le-Gia

TL;DR: 本文提出了一种名为CS3F的训练无关、基于批处理的零样本异常检测框架,用于3D医学图像。该方法将3D体数据沿解剖轴分解为2D切片,利用预训练的2D视觉Transformer进行编码,并通过聚合相邻切片特征生成局部体素令牌。异常分数通过跨受试者的令牌相似性计算得出,缺乏相似对应物的令牌被判定为异常。

Details

Motivation: 解决3D医学图像中零样本异常检测的挑战,因为现有方法多为2D设计,直接扩展到3D受限于大规模体数据基础模型的稀缺或难以有效利用体积上下文信息。

Result: 在脑部MRI(包括转移瘤、胶质瘤和中风)以及肺部CT上进行了评估,结果表明冻结的2D基础模型能够支持3D医学图像的异常定位,且精细令牌化的效果高度依赖于病变对比度和成像模态。

Insight: 创新点在于提出了一种无需训练、利用2D基础模型处理3D数据的批处理方法,通过跨受试者令牌相似性进行异常评分,并引入了由粗到细的令牌化策略以减少深度池化对病灶信号的衰减,实现了在3D医学图像中的有效零样本异常检测。

Abstract: Zero-shot anomaly detection (ZSAD) is attractive for medical imaging because clinical systems must handle heterogeneous acquisition protocols, changing patient populations, and pathologies for which annotated training data may be unavailable. Most existing zero-shot anomaly detection methods are designed for 2D images, and their direct extension to 3D medical volumes is limited by the scarcity of large-scale volumetric foundation models or by the difficulty of utilizing volumetric context. We propose CS3F, a training-free batch-based framework for ZSAD in 3D medical images using 2D foundation models. Each volume is decomposed along multiple anatomical axes and encoded slice-wise by a 2D vision transformer. These are then converted into localized volumetric tokens by pooling neighboring slice features. Anomaly scores are obtained from cross-subject mutual similarity: tokens that lack close analogues in other subjects are assigned higher anomaly scores. To reduce the attenuation of focal lesion signals caused by depth pooling, we introduce a coarse-to-fine tokenization strategy that enables fine-resolution volumetric scoring without exhaustive matching. CS3F is evaluated on brain MRI across metastases, glioma, and stroke, as well as validated on lung CT to test generalizability beyond atlas-aligned brain MRI. The results show that frozen 2D foundation models can support anomaly localization in 3D medical images, and that the benefit of fine tokenization depends strongly on lesion contrast and imaging modality.


[38] SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction cs.CV | cs.CL | cs.MMPDF

Quanjiang Guo, Chong Mu, Jiazhou Pan, Ming Jia, Ling Tian

TL;DR: 本文提出了SAMA(语义锚点对齐增强)框架,用于解决低资源多模态信息提取(MIE)任务中的数据稀缺问题。该框架通过构建结构化语义锚点,指导协作多专家多模态大语言模型生成高质量、任务感知的合成文本数据,并利用锚点保持扩散机制合成多样化图像,最后通过双约束过滤模块自动筛选样本。

Details

Motivation: 多模态信息提取任务(如多模态命名实体识别、关系抽取和事件抽取)因数据严重稀缺而受限,现有数据增强方法存在跨模态对齐粗糙、设计碎片化且任务特定、无法利用共享语义知识等问题。

Result: 在MNER、MRE和MEE的基准数据集上进行的大量实验表明,SAMA在全监督和低资源设置下均持续优于最先进的数据增强基线方法。

Insight: 创新点在于提出了一个统一的、基于语义锚点对齐的增强框架,通过结构化语义锚点指导文本和图像的协同生成,并引入双约束过滤实现自动验证,有效利用了跨任务的共享语义知识并确保了合成数据的高保真度和多样性。

Abstract: Multimodal Information Extraction (MIE)-covering tasks such as Multimodal Named Entity Recognition (MNER), Relation Extraction (MRE), and Event Extraction (MEE)-is essential for understanding multimedia content but remains constrained by severe data scarcity. Although data augmentation is a promising remedy, existing approaches are impeded by coarse cross-modal alignment and fragmented, task-specific designs that fail to exploit shared semantic knowledge. To overcome these limitations, we introduce Semantic Anchor-aligned Multimodal Augmentation (SAMA), a unified framework for generating high-fidelity, task-aware synthetic data. SAMA constructs structured semantic anchors from ground-truth labels to guide a Collaborative Multi-Experts Multimodal Large Language Model (CME-MLLM), which integrates a Universal Adapter for shared semantics with Task-Specific Adapters to produce diverse yet constraint-compliant textual samples. For image synthesis, SAMA employs an Anchor-Preserving Diffusion mechanism that uses anchor-weighted prompts and latent conditioning to maintain critical semantic anchors while diversifying visual contexts. To eliminate the need for manual verification, SAMA further introduces a Dual-Constraint Filtering module that selects synthetic samples based on both cross-modal consistency and anchor fidelity. Extensive experiments across benchmark datasets for MNER, MRE, and MEE demonstrate that SAMA consistently outperforms state-of-the-art augmentation baselines under both fully supervised and low-resource settings, underscoring its versatility, robustness, and effectiveness.


[39] HandwritingAgent: Language-Driven Handwriting Synthesis in Scalable Vector Space cs.CV | cs.CLPDF

Jaward Sesay, Yue Yu, Börje F. Karlsson

TL;DR: 本文提出了HandwritingAgent,一种基于语言驱动的智能体,用于在可缩放矢量图形(SVG)空间中合成自然手写序列。该方法利用大型推理模型在离散网格画布环境中几何分析并自回归生成目标手写字形,无需针对特定风格进行训练,并通过自然语言和参考图像灵活控制风格。

Details

Motivation: 当前手写合成方法存在风格特定架构、依赖大数据集、计算成本高以及缺乏通过自然语言灵活控制书写风格的问题,本文旨在解决这些限制。

Result: 在模仿、识别、多语言手写合成以及复杂数学和科学表达式生成等多样化任务上的实验表明,HandwritingAgent在性能上取得显著提升,匹配或超越了最先进的生成式手写模型。

Insight: 创新点在于将手写合成构建为语言驱动的智能体任务,直接在SVG矢量空间生成,实现了无需风格特定训练的高效、可控且可泛化的合成方法,并通过自然语言指令提供了灵活的风格控制。

Abstract: Teaching machines to emulate natural handwriting styles remains an open challenge, as it requires synthesizing stroke sequences that dynamically vary in shape, texture, pressure and script - not only across individuals, but also within a single person’s handwriting. Attempts at this challenge have largely explored deep learning methods in both online and offline settings. However, these approaches are often constrained by style-specific architectural choices, heavy reliance on large datasets, high compute costs, and a lack of flexible control over writing styles through natural language. To this end, we introduce HandwritingAgent, a language-driven agent that can synthesize natural handwriting sequences directly in Scalable Vector Graphics (SVG) format with no need for style-specific training. The agent leverages a large reasoning model to geometrically analyse and autoregressively generate target handwritten glyphs as stroke sequences in a discrete grid canvas environment. Generation is conditioned on texts provided in either conversational or non-conversational mode, along with a reference handwriting-style image. Experiments on diverse handwriting tasks spanning imitation, recognition, multi-lingual handwriting synthesis, and generation of complex handwritten maths and science expressions indicate substantial improvement in performance, with HandwritingAgent matching or surpassing state-of-the-art generative handwriting models, while providing a more efficient, controllable, and generalizable synthesis method.


[40] Where Will They Go? Modelling Multimodal Pedestrian Manoeuvres from Ego-centric Videos cs.CV | cs.LGPDF

Yuxuan Xie, Nicolas Pugeault, Chongfeng Wei, Hubert P. H. Shum, Edmond S. L. Ho

TL;DR: 本文提出MMPM,一种用于从第一人称视角视频预测行人轨迹的模式感知框架。该框架通过建模行人过街行为,将未来轨迹分布分离为语义上有意义的模式(过马路与不过马路),以解决现有随机预测器因从单一单峰分布采样而产生的次优混合模式轨迹问题。

Details

Motivation: 从第一人称视角预测行人轨迹具有挑战性,因为它依赖于行人与车辆、场景的复杂交互以及行人意图。现有随机预测器通常从单一单峰分布采样多个未来轨迹,这会产生位于不同运动模式之间、在真实场景中不合理的次优’混合模式’轨迹。

Result: 在PIE和JAAD数据集上的实验表明,该方法超越了最先进的基线模型。提出的MTP模块是模型无关的,可集成到BiTrap-NP和SGNet-ED等现有框架中,进一步提升未来轨迹预测性能。

Insight: 创新点在于提出一个模式感知框架,将未来轨迹分布按行人过街行为(过马路/不过马路)进行语义分离建模,并使用基于查询的解码器在解码时强制模式一致性。此外,引入了一个数据驱动的验证协议,将预测与时空一致的真实轨迹进行匹配,从而改进了逐帧位移误差。

Abstract: Pedestrian trajectory prediction from an ego-centric camera is challenging since it depends on complex interactions with vehicles and scene context, as well as the intention of the pedestrian. By modelling correlation and intent from the historical and future trajectories of the pedestrian, it will usually result in a multimodal (i.e. multiple modes) distribution. Existing stochastic predictors often sample multiple futures from a single unimodal distribution, which can yield sub-optimal ‘mixed-mode’ trajectories that lie between distinct motion patterns and become implausible in real scenes. In this paper, we propose MMPM, a mode-aware framework that separately models future trajectory distributions into semantically meaningful modes based on the pedestrian’s crossing behavior. MMPM consists of two modules: behavior-aware Pedestrian Interaction Module (PIM) that jointly captures pedestrian-vehicle and pedestrian-environment interactions by introducing gaze, head and hand gesture, and a CVAE-based Mode-aware Trajectory Predictor (MTP) module to model the future trajectory distributions on two modes, crossing and non-crossing the road, separately. A query-based decoder further enforces mode consistency during decoding. Experiments on PIE and JAAD datasets show that our method surpasses state-of-the-art baselines. Our proposed MTP is model-agnostic, which can be integrated into existing frameworks such as BiTrap-NP and SGNet-ED to further improve future trajectory prediction performance. We additionally introduce a data-driven validation protocol that matches predictions to spatio-temporally consistent ground-truth trajectories, demonstrating improved frame-wise displacement errors over previous work.


[41] Rethinking Air-Ground Collaboration: A Progressive Cross-Task Benchmark and Socialized Learning Framework cs.CVPDF

Zhoupeng Guo, Yunqi Zhu, Zhihe Fan, Xinjie Yao, Ruipu Zhao

TL;DR: 本文重新思考空地协同感知,将其建模为一个渐进式跨任务协作问题,并构建了一个包含超过74.5万原始视频帧的时空对齐基准测试集AGPC。在此基础上,提出了一种从粗到精的社会化协同感知框架SCP,其核心模块双层路由器通过解耦输入侧多尺度专家选择和输出侧任务条件调制,实现了选择性跨视图和跨任务交互。

Details

Motivation: 现有研究通常将空地协同感知视为单一任务的跨视图融合,忽略了定位、目标关联和细粒度解析等功能之间的依赖关系,且异构视图间的几何、尺度和遮挡差异使得统一特征共享容易产生负迁移。

Result: 在提出的AGPC基准上进行的大量实验表明,SCP框架实现了3.73%的协同进化增益,并在平均下游任务性能上提升了7.86%,证明了任务条件协作比统一融合更有效。

Insight: 主要创新点在于将空地协同感知建模为渐进式跨任务协作,并提出了一个包含双层路由器的社会化学习框架,该路由器通过解耦专家选择和任务调制来抑制有害干扰,实现选择性交互。这为异构视图下的多任务协同感知提供了新思路。

Abstract: Air-ground collaborative perception is crucial for robust visual understanding in real-world dynamic environments. However, existing studies typically formulate collaboration as single-task cross-view fusion, overlooking the functional dependencies among localization, target association, and fine-grained parsing. In addition, the heterogeneous nature of aerial and ground views introduces substantial geometric, scale, and occlusion discrepancies, making uniform feature sharing vulnerable to negative transfer. To tackle these issues, we model air-ground perception as a progressive cross-task collaboration task and construct the Air-Ground Progressive Collaboration (AGPC) benchmark, a spatio-temporally aligned benchmark comprising more than 745K raw video frames. Built upon this benchmark, we propose Socialized Co-Perception (SCP), a coarse-to-fine framework that organizes collaboration progressively from aerial global localization to ground target association and identity-aware parsing. Its core module, the Dual-Layer Router (DLR), decouples input-side multi-scale expert selection from output-side task-conditioned modulation, enabling selective cross-view and cross-task interaction while suppressing harmful interference. Extensive experiments demonstrate the effectiveness of SCP. It achieves a 3.73% coevolutionary gain and a 7.86% improvement in average downstream performance. These results show that task-conditioned collaboration is more effective than uniform fusion for heterogeneous air-ground perception. The code is available at https://github.com/g1136639260-spec/AGSCP.


[42] From Bounding Boxes to Visual Reasoning: An On-Policy Data Annotation Tool for Vision-Language Models cs.CVPDF

Like Zhang, Runliang Niu, Shiqi Wang, Xiyu Hu, Qianli Xing

TL;DR: 本文介绍了ScreenAnnotator,一个用于视觉语言模型的开源数据标注工具,旨在解决现有工具在表达力、标注-训练解耦和数据复用性方面的瓶颈。它通过统一的标注原子模式、集成贝叶斯标注验证器的策略内标注循环,以及模板驱动的多任务数据合成流程,高效生成支持空间坐标、开放词汇描述和拓扑关系的复杂视觉推理数据。在流程图和GUI截图场景中,该工具显著提高了标注接受率并减少了单图标注时间,基于其数据微调的VLM在流程图任务上取得了76.1%的平均准确率,相对提升了35.1个百分点。

Details

Motivation: 现有数据标注工具无法满足视觉语言模型进行复杂视觉推理训练的需求,存在表达力有限、标注与训练严重解耦以及数据复用性差三大系统性瓶颈,因此需要开发一种新型标注基础设施。

Result: 在流程图和GUI截图场景中,策略内标注循环使标注接受率分别达到近100%和77%,并随着标注数据积累持续降低单图标注时间。使用该工具生成的数据微调VLM,在流程图任务上实现了76.1%的平均准确率,相比基线绝对提升了35.1个百分点。

Insight: 创新点在于提出了一个统一的标注原子模式,将空间、语义和结构基元绑定为单一单元;并设计了集成了贝叶斯验证器的策略内标注循环,实现了标注与模型训练的紧密耦合;此外,通过模板驱动的数据合成动态生成多样化任务,提升了数据复用性,避免了重复标注。

Abstract: Vision-language models (VLMs) are rapidly advancing toward sophisticated grounded structured visual reasoning. Training models for such advanced capabilities demands a new genre of data that seamlessly unifies spatial coordinates, open-vocabulary descriptions, structured attributes, and topological relationships into a singular representation. However, existing data annotation tools fundamentally fail to meet these intricate demands, suffering from three systematic bottlenecks: limited expressiveness, severe annotation-training decoupling, and poor data reusability. To bridge this infrastructure gap, we introduce an open-source annotation tool, ScreenAnnotator. First, we define a unified annotation atom schema that binds spatial, semantic, and structural primitives into a single unit. Second, we implement an on-policy annotation loop embedded with a Bayesian Annotation Verifier (BAV). Finally, we design a template-driven multi-task data synthesis process dynamically transforms static atoms into diverse multi-dimensional reasoning tasks, eliminating redundant re-annotation. The on-policy loop drives the annotation accept rate to nearly 100% on flowcharts and 77% on GUI screenshots, while steadily reducing per-image annotation time as labeled data accumulate. In the flowchart scenario, fine-tuning a VLM yields 76.1% average accuracy, which is a 35.1% point absolute gain. Our code is available at: https://github.com/WnQinm/Annotator.


[43] Automatic ply-specific analyses of CFRP micrographs using shortest-path-based ply distinction cs.CVPDF

Jonas Naumann, Jonas P. Appels, Julius Biermann, Christopher Gorsky, Timo de Wolff

TL;DR: 本文提出了一种基于最短路径算法的自动化方法,用于在高分辨率碳纤维增强聚合物(CFRP)显微图像的语义分割掩码中区分不同的铺层实例。该方法将分割掩码视为以像素为顶点的图,利用最短路径算法生成铺层分离路径,从而利用全局信息弥合了语义分割与铺层实例分割之间的差距。

Details

Motivation: 解决在高分辨率CFRP显微图像中,如何从语义分割掩码自动、准确地识别和区分不同铺层实例的问题,以实现对微观结构的定量分析。

Result: 该方法成功应用于具有多种特征(如单层或多层人为添加间隙、不同铺层顺序、贯穿铺层的裂纹)的高分辨率显微图像,实现了对局部纤维体积分数、铺层厚度及层间厚度等微观结构属性的全面定量铺层分析。

Insight: 创新点在于将语义分割掩码图论化,利用最短路径算法提取全局的铺层边界路径,实现了从像素级分类到实例级分割的转换,为材料微观结构分析与制造工艺、力学性能的关联提供了自动化工具。

Abstract: We present an automated approach to distinguish between ply instances in semantic segmentation masks of high-resolution carbon-fiber reinforced polymer micrographs. Interpreting the segmentation mask as a graph with pixels as vertices, enables us to use a shortest-path algorithm yielding the ply-separating paths. Thereby, we bridge the gap between semantic segmentation and ply instance segmentation using global information. We successfully apply our approach on high-resolution micrographs featuring a broad range of characteristics like artificially added gaps in single or multiple plies, different stacking sequences and ply traversing cracks. Assigning each fiber pixel to a ply based on the calculated paths, allows for a comprehensive, quantitative ply analysis with respect to its microstructural properties like the local fiber volume fraction as well as locally resolved ply and interleaf layer thickness. These insights help to reveal manufacturing-induced inhomogeneities, draw conclusions on manufacturing parameters and link mechanical properties to underlying microstructural imperfections.


[44] Physics-IQ Verified cs.CVPDF

Tim Rädsch, Yuki M Asano, Hilde Kuehne, Stefan Bauer, Priyank Jaini

TL;DR: 本文对评估视频生成模型物理理解能力的Physics-IQ基准进行了系统性审计,指出了其存在的缺陷,并提出了三项改进方案:提升提示词和真实视频质量以减少混淆因素,以及引入样本级评分系统。改进后的新基准Physics-IQ Verified优化了超过一半的样本和超过三分之一的提示词,并在六种图像到视频生成模型的比较研究中,观察到了中等但具有意义的模型排名变化。

Details

Motivation: 视频生成模型(VGMs)被用于世界建模等多种下游任务,因此需要理解世界的物理现实。现有的Physics-IQ基准旨在量化这种理解,但其评估方法存在不足,影响了测量的可靠性。

Result: 改进后的Physics-IQ Verified基准在六种图像到视频生成模型的比较研究中,导致了中等但具有意义的模型排名变化(Kendall’s τ = 0.46)。

Insight: 论文的创新点在于系统性地审计并改进了物理理解评估基准,通过提升数据质量和引入更公平的样本级评分机制,使评估信号更加可靠。这为客观衡量视频生成模型的物理常识提供了更严谨的方法论。

Abstract: Video generative models ( VGMs) have become a new frontier that can be used not just for video generation but for a multitude of downstream tasks, including world modeling. To advance these tasks, a good video model must understand the physical reality of the world. Evaluating this understanding is an emerging field and has led to the Physics-IQ benchmark, which quantifies this explicitly by comparing model-generated videos to real-world videos of physical experiments. In this work, we present a systematic audit of the Physics-IQ benchmark, expose shortcomings and propose three solutions that sharpen how we can measure physical understanding of VGMs. Specifically, we improve prompt and ground-truth quality to reduce the influence of confounding factors and further introduce a sample-level scoring system that weights each sample and metric equally. Our resulting benchmark, Physics-IQ Verified, refines 57.6% of all samples and improves over 34.8% of prompts. In a comparison study using six image-to-video generative models, we observe moderate but meaningful ranking changes (Kendall’s $τ= 0.46$). We hope Physics-IQ Verified advances the community by providing a more reliable signal toward physically accurate VGMs. The code for the benchmark can be accessed at https://github.com/google-deepmind/physics-iq-benchmark


[45] SP-TransientBench: A Real-Captured Single Photon Perception Benchmark cs.CVPDF

Hongzhou Dong, Zili Zhang, Ziting Wen, Yiheng Qiang, Runrong Deng

TL;DR: 该论文提出了SP-TransientBench(STB),一个真实捕获的单光子感知多任务基准数据集,旨在解决单光子激光雷达(SPL)在真实世界感知中因噪声和复杂瞬态现象而面临的挑战。该数据集包含10个不同场景的10,297个视图,提供飞行时间直方图、标准化元数据和相机位姿,并支持深度估计、多视图重建和3D语义理解等任务的评估。

Details

Motivation: 现有的基于单光子雪崩二极管(SPAD)的感知研究大多局限于模拟数据或小规模受控捕获,缺乏对真实世界单光子感知在多个3D视觉任务上的系统性评估。

Result: 论文构建了一个包含10个场景、10,297个视图的真实捕获数据集,每个视图提供256x192分辨率的飞行时间直方图和多返回行为数据,并包含13类3D语义标注,为多个任务提供了专用的数据划分和评估协议。

Insight: 创新点在于首次提供了一个大规模、真实捕获的单光子感知基准,涵盖了多返回瞬态现象和标准化元数据,能够支持深度估计、多视图重建和3D语义理解等多个3D视觉任务的系统性、可复现的基准测试。

Abstract: Single-photon LiDAR (SPL) based on single-photon avalanche diode (SPAD) sensing enables time-resolved photon measurements with extreme sensitivity, offering unique potential for active 3D perception in photon-starved scenarios.However, real-world single photon perception remains fundamentally challenging due to unique measurement noise and complex multi-return transient phenomena, which jointly complicate geometric reconstruction and semantic scene understanding. Despite growing interest in SPAD-based sensing, existing studies are largely limited to simulated data or small-scale controlled captures. As a result, systematic evaluation of real-world single photon perception across depth estimation, multi-view reconstruction, and 3D semantic understanding remains underexplored. To bridge this gap, we introduce SP-TransientBench (STB), a real-captured multi-task benchmark for single photon perception. SP-TransientBenc comprises 10 diverse scenes and 10,297 views captured using a solid-state single-photon LiDAR at $256\times192$ resolution. Each view provides full time-of-flight histograms with multi-return behavior,standardized metadata, and calibrated camera poses for multi-view evaluation. We further provide 13-class 3D semantic annotations for selected scenes. By providing dedicated data splits and evaluation protocols for each task, STB enables consistent and reproducible benchmarking of real-world single photon perception across multiple 3D vision problems. The dataset and code will be released upon acceptance.


[46] Motion-Focused Latent Action Enables Cross-Embodiment VLA Training from Human EgoVideos cs.CV | cs.ROPDF

Runze Xu, Yiluo Zhang, Jian Wang, Yu Wang, Jincheng Yu

TL;DR: 本文提出了一种基于潜在动作的框架,用于从无标注的人类自我中心视频中提取通用动作先验,以训练跨具身的视觉-语言-动作模型。该框架采用混合解耦VQ-VAE,通过物理掩码将运动动态与环境背景分离,构建跨具身动作码本。通过该码本在人类视频上进行预训练,VLM骨干学习动作意图的深度表示,并通过意图-感知解耦策略适应特定具身,减少动作幻觉。

Details

Motivation: 解决训练通用VLA模型需要大量带高保真动作标注的机器人数据的问题,利用丰富且环境多样的人类自我中心操纵视频,但缺乏动作标注,难以用于传统训练范式。

Result: 在仿真和真实环境中的实验表明,仅使用无标注人类视频预训练的方法,在下游适应仅需50条轨迹的情况下,性能与基于大规模标注数据集训练的最先进VLA模型相当。

Insight: 创新点包括:1) 混合解耦VQ-VAE通过物理掩码分离运动与背景,构建跨具身动作码本;2) 意图-感知解耦策略,VLM预测动作意图,冻结视觉编码器提供状态特征,减少幻觉;3) 仅用无标注人类视频预训练即可实现高效跨具身适应。

Abstract: Training generalist Vision-Language-Action(VLA) models typically requires massive, diverse robotic datasets with high-fidelity action annotations. While egocentric human manipulation videos are abundant and capture significant environmental diversity, the absence of action labels makes them difficult to use in conventional training paradigms. To address this, we propose a latent-action-based framework designed to extract general action priors from unlabeled human videos. The architecture features a Hybrid Disentangled VQ-VAE that decouples motion dynamics from environmental backgrounds through physical masks, enabling the construction of a cross-embodiment action codebook. By pre-training on human videos with the codebook, the VLM backbone learns deep representations of action intent. For adaptation to specific embodiments, we introduce an intent-perception decoupling strategy where the VLM predicts the action intent while a separate frozen visual encoder provides state-specific features to the action expert, thereby reducing action hallucinations. Results in simulation and real-world environments show that our method, pre-trained exclusively on unlabeled human videos, performs competitively with state-of-the-art VLA models trained on massive annotated datasets, requiring only 50 trajectories for downstream adaptation.


[47] Mem-World: Memory-Augmented Action-Conditioned World Models for Persistent Robot Manipulation cs.CV | cs.ROPDF

Zirui Zheng, Jiaqian Yu, Xiongfeng Peng, jun shi, Mingyi Li

TL;DR: 本文提出了Mem-World,一种用于机器人操作的记忆增强多视图动作条件世界模型。该模型通过引入W-VMem(一种以腕部视图为中心的4D表面元素索引记忆),将历史观测锚定到随时间演化的表面元素上,从而在动态操作场景中实现几何感知的历史帧检索,以生成持久且一致的动作条件视频推演。

Details

Motivation: 当前动作条件世界模型在机器人操作中面临持久性建模的挑战:频繁的末端执行器遮挡和快速的腕部相机运动导致当前观测不足以预测未来视图,使得模型容易遗忘或幻觉早期帧中的场景细节,而现有的记忆检索策略在动态操作场景中难以识别信息丰富的历史帧。

Result: 在复杂操作场景中,Mem-World生成了持久的推演;与Ctrl-World相比,其策略评估更可靠,与现实世界性能的皮尔逊相关性提高了14.5%;通过合成数据生成支持有效的策略改进,在长视野任务上将成功率从58%提升至72%。

Insight: 创新点在于提出了W-VMem,这是一种显式建模场景元素观测时间和位置的4D表面元素索引记忆,实现了基于几何的未来动作条件历史帧检索;从客观角度看,该方法通过将历史观测锚定到动态表面元素,有效解决了操作场景中的遮挡和运动导致的记忆失效问题,为世界模型提供了非冗余且信息丰富的上下文。

Abstract: Action-conditioned world models have emerged as a promising paradigm for robot learning, offering a scalable alternative to costly real-world experimentation by generating action-consistent video rollouts. However, persistent world modeling remains challenging in manipulation: frequent end-effector occlusions and rapid wrist-camera motion make the current observation insufficient for predicting future views, causing models to forget or hallucinate scene details seen in earlier frames. Existing memory retrieval strategies often fail to identify informative history in dynamic manipulation scenarios. To address this limitation, we propose Mem-World, a memory-augmented multi-view action-conditioned world model. At its core, we present W-VMem, a 4D wrist-view-centered surfel-indexed memory that anchors historical observations to temporally evolving surface elements. By explicitly modeling when and where scene elements are observed, W-VMem enables geometry-aware retrieval of relevant history frames conditioned on future actions. During generation, relevant history frames are selected via surfel-based rendering and scoring, providing informative and non-redundant context for prediction. Extensive experiments show that Mem-World generates persistent rollouts in complex manipulation scenarios, enables more reliable policy evaluation than Ctrl-World, improving the Pearson correlation with real-world performance by 14.5%, and supports effective policy improvement through synthetic data generation, increasing success rates from 58% to 72% on long-horizon tasks.


[48] Visual-OPSD: Cross-Modal On-Policy Self-Distillation for Efficient Unified Multimodal Reasoning cs.CVPDF

Pengyu Li, Zhitao Gao, Lingling Zhang, Muye Huang, Yuanming Li

TL;DR: 该论文提出了Visual-OPSD方法,一种跨模态的在线策略自蒸馏技术,旨在解决统一多模态模型中视觉思维生成带来的巨大推理开销问题。研究发现,生成的视觉像素本身对任务精度贡献有限,但生成过程编码了有用的推理信息。通过让教师模型(拥有特权视觉思维)和学生模型(仅文本)共享权重并进行在线策略的JSD蒸馏,该方法成功将教师模型的推理能力迁移到纯文本学生模型上,实现了显著的性能提升和加速。

Details

Motivation: 统一多模态模型通过交错生成视觉思维与文本来提升空间任务性能,但多步扩散过程导致推理成本增加一个数量级,且研究发现这种成本带来的直接收益有限。动机在于利用视觉思维生成过程所编码的、超越渲染像素的语义推理信息,来训练一个高效、仅依赖文本推理的学生模型。

Result: 在九个基准测试上,Visual-OPSD方法相比其生成式教师模型,性能提升了3.40个百分点,同时实现了14.3倍的加速(每样本10.0秒 vs. 142.8秒)。在VSP基准上,它超越了同规模视觉语言模型63.83个百分点。高斯噪声对照实验和KL差距58.4%的缩小证实了性能增益来源于生成路径的语义内容。

Insight: 核心创新点在于揭示了统一多模态模型中视觉思维生成路径的语义价值,并提出了在线策略自蒸馏框架,通过共享权重的教师-学生架构和基于JSD的token级蒸馏,将昂贵的跨模态生成过程的知识压缩到高效的纯文本推理中。这为构建高性能、低延迟的多模态推理系统提供了新思路。

Abstract: Unified multimodal models (UMMs) interleave generated ‘’visual thoughts’’ (VTs) with text reasoning to improve spatial tasks. This incurs roughly an order-of-magnitude inference cost from multi-step diffusion. We find this cost yields limited direct benefit. On ThinkMorph, removing or noising VTs barely changes accuracy across nine benchmarks. Once rendered, attention concentrates on the VT regardless of content. Yet a KL diagnostic shows that conditioning on a privileged VT trace shifts the model’s completion distribution. This suggests the generation pathway encodes useful reasoning beyond the rendered pixels. Motivated by this gap, we propose Visual On-Policy Self-Distillation(Visual-OPSD). Teacher and student share identical weights but differ in context: the teacher sees privileged VTs while the student sees only the question. Token-level JSD distillation on on-policy student trajectories transfers the teacher’s reasoning to a text-only student. Across nine benchmarks, Visual-OPSD improves over its generative teacher by $+3.40$pp with $14.3\times$ speedup (10.0s vs. 142.8s per sample) and outperforms same-scale VLMs by $+63.83$pp on VSP. A Gaussian-noise control ($+0.40$pp vs. $+10.28$pp for real VTs) and $58.4%$ closure of the KL gap confirm that gains come from the semantic content of the generation pathway.


[49] Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: From Evaluation to Diagnosis cs.CVPDF

Hong-Tao Yu, Chen-Wei Xie, Yuxin Peng, Serge Belongie, Xiu-Shen Wei

TL;DR: 本文提出了一个名为FG-BMK的细粒度图像任务评估基准,包含101万个问题和28万张图像,用于全面评估大型视觉语言模型(LVLMs)在细粒度视觉识别和区分能力上的表现。研究发现当前LVLMs在细粒度任务上仍存在不足,并分析了其失败原因及改进方向。

Details

Motivation: 现有基准多从整体或特定任务角度评估LVLMs,而对其在计算机视觉基础——细粒度图像任务上的能力理解不足,因此需要构建专门的评估基准来诊断模型瓶颈。

Result: 在多种代表性LVLMs/VLMs上的广泛实验表明,当前模型在细粒度识别任务上表现不佳,失败原因涉及视觉表征、语义对齐、模态对齐和类别知识等多个交织的瓶颈。

Insight: 创新点在于构建了大规模、多场景的细粒度评估基准FG-BMK,并设计了面向人类和机器的评估范式,能够诊断模型失败的具体原因(如视觉表征不足或语义对齐弱),为未来数据构建和模型设计提供了诊断性见解和指导。

Abstract: Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception and reasoning capabilities. While numerous benchmarks have evaluated LVLMs from holistic or task-specific perspectives, their capabilities on fine-grained image tasks-fundamental to computer vision-remain insufficiently understood. To address this gap, we introduce FG-BMK, a comprehensive fine-grained evaluation benchmark containing 1.01 million questions and 0.28 million images, covering diverse scenarios from common object-centric domains to specialized domains. FG-BMK jointly evaluates dialogue-level fine-grained semantic recognition and feature-level visual discriminability through human-oriented and machine-oriented paradigms, enabling diagnostic analysis of whether LVLM failures arise from insufficient visual representations, weak visual-to-semantic grounding, or limited fine-grained knowledge. Through extensive experiments on a diverse set of representative LVLMs/VLMs, we find that current LVLMs remain inadequate fine-grained recognizers, with failures arising from intertwined bottlenecks in visual representations, semantic grounding, modality alignment, and category-level knowledge. We further analyze training design factors for improving fine-grained capabilities and examine how visual and linguistic perturbations affect LVLM predictions. These findings provide diagnostic insights into the limitations of current LVLMs and offer guidance for future data construction and model design in developing more reliable LVLMs for fine-grained visual tasks. Our code is open-source and available at https://fg-bmk.github.io/.


[50] DREAM: Extending Vision-Language Models with Dual-Objective Encoding for Cross-Modal Retrieval cs.CVPDF

Kaleem Ullah, Altaf Hussain, Muhammad Munsif, Sung Wook Baik

TL;DR: 本文提出了DREAM模型,一种用于跨模态检索的双路径表示增强与对齐模型。该模型通过结合掩码与置换语言建模目标来增强文本编码,并设计了具有级联分组注意力的分层视觉编码器来整合时空信息。在MSRVTT、MSVD和LSMDC基准测试中取得了新的最先进性能。

Details

Motivation: 解决现有视频检索系统在建模细粒度时间依赖性和复杂语言结构方面的不足,特别是在处理大规模视频内容时,需要更有效地对齐自然语言查询与动态视频语义。

Result: 在MSRVTT、MSVD和LSMDC基准数据集上,DREAM分别达到了49.4%、49.7%和27.3%的R1分数,创造了新的最先进水平。

Insight: 创新点在于提出了双目标文本建模策略(掩码与置换语言建模结合)以捕获局部和全局语言语义,以及分层视觉编码器(级联分组注意力)来实现从粗到细的时空信息整合,这为跨模态表示学习提供了新的架构思路。

Abstract: In today’s media-driven world, the exponential growth of video content across domains such as surveillance, education, and entertainment has made retrieving semantically relevant videos via natural language queries increasingly critical. Early video retrieval systems relied on handcrafted features or shallow cross-modal mappings, limiting their ability to capture complex semantics and temporal dynamics. While large-scale vision-language models have improved cross-modal alignment, challenges remain in modeling fine-grained temporal dependencies and nuanced linguistic structures. In this paper, we introduce DREAM: Dual-path Representation Enhancement and Alignment Model, a novel multimodal framework that addresses these limitations through enhanced visual and textual encoding. DREAM incorporates a hybrid language modeling strategy that combines masked and permuted language modeling objectives to capture both local and global linguistic semantics. On the visual side, we design a hierarchical vision encoder with cascaded group attention, which integrates spatial and temporal information through multi-stage token interaction and coarse-to-fine attention refinement. We validate DREAM through comprehensive evaluations on the widely-used MSRVTT, MSVD and LSMDC benchmark datasets, where it achieves new state-of-the-art R1 scores of 49.4%, 49.7% and 27.3%, respectively. Qualitative analyses further show the model’s ability to maintain coherent attention across frames and align complex queries with dynamic video content. These findings underscore the effectiveness of hierarchical attention and dual-objective textual modeling in enabling robust, context-aware video retrieval, and pave the way for future research in advancing cross-modal representation learning.


[51] PorTEXTO: A European Portuguese Benchmark for Visual Text Extraction cs.CVPDF

João Cardeira, Diogo Glória-Silva, Manuel Letras da Luz, Rafael Ferreira, Diogo Tavares

TL;DR: 本文介绍了PorTEXTO,这是首个针对现代欧洲葡萄牙语(pt-PT)视觉文本提取的基准数据集,旨在填补现有OCR基准中该语言在当代应用上的空白。研究通过结合前沿大视觉语言模型的转录和母语者的详尽审查来确保数据质量,并发现当前模型在合成数据到真实世界样本上性能显著下降,且专门的多语言数据比模型规模或分辨率预算更能提升pt-PT性能。

Details

Motivation: 解决欧洲葡萄牙语在OCR基准中缺乏现代和文化相关应用的问题,现有基准多集中于历史文物和文学,而忽视了当代需求。

Result: 在PorTEXTO基准上,大多数模型从合成到真实世界样本的性能急剧下降;研究发现专门的多语言数据是提升pt-PT性能的关键驱动因素,而非模型大小或分辨率预算。

Insight: 创新点在于创建了首个当代欧洲葡萄牙语视觉文本提取基准,并采用LVLM转录加母语者审查的标注流程确保质量;客观分析表明,针对低资源语言,开放专门的数据资源比单纯扩大模型更有效,这为OCR领域提供了数据优先的策略启示。

Abstract: European Portuguese (pt-PT) is largely absent from OCR benchmarks, which skew toward high-resource languages. The few benchmarks that cover pt-PT focus on historical artifacts and literature. This work addresses modern OCR applications, introducing PorTEXTO, the first benchmark for contemporary and culturally relevant pt-PT visual text extraction. To ascertain quality, we employ an annotation pipeline combining transcriptions from a frontier LVLM with exhaustive review by native speakers. We observe a sharp performance drop from synthetic to real world samples in most models, and find that, currently, specialized multilingual data is a better driver for pt-PT performance than model size or resolution budget, motivating the release of open pt-PT OCR resources.


[52] Taming I2V models for Image HOI Editing: A Cognitive Benchmark and Agentic Self-Correcting Framework cs.CVPDF

Jiayi Gao, Qingchao Chen, Yuxin Peng, Yang Liu

TL;DR: 本文针对图像编辑中复杂人-物交互(HOI)的挑战,提出了HOI-Edit基准测试和自校正过程编辑(SCPE)框架。首先,作者构建了一个包含三个渐进认知层次的HOI-Edit基准,并设计了自动化评估指标HOI-Eval,利用视觉语言模型通过问答来可靠评估实例级别的交互。其次,作者发现图像到视频(I2V)模型因其时序生成能力而天生适合动态编辑任务,并基于此提出了SCPE框架,该框架通过迭代优化的提示词约束I2V模型的生成,从而生成更准确呈现目标HOI的视频,其提取的帧即为最终编辑结果。

Details

Motivation: 现有图像编辑方法擅长处理静态属性,但在处理复杂的人-物交互(HOI)时存在不足。现有基准测试将HOI与静态属性混为一谈,且依赖的全局指标无法同时评估动态交互的有效性和纠缠的人-物对保持度,这是一个尚未解决的关键挑战。

Result: 在提出的HOI-Edit基准上,SCPE框架在交互编辑方面的性能与Nano Banana等最先进的(SOTA)编辑模型相当。

Insight: 主要创新点包括:1)提出了首个专注于评估复杂人-物交互编辑的综合性基准HOI-Edit,并设计了基于视觉语言模型问答的自动化评估指标HOI-Eval;2)开创性地将图像到视频(I2V)模型应用于图像编辑任务,利用其时序生成能力来建模动态关系,并提供了独特的“失败过程回放”诊断能力;3)提出了一个新颖的、具有自主性的自校正过程编辑(SCPE)框架,通过迭代优化提示词来引导I2V模型生成,从而提升交互编辑的准确性。

Abstract: Current image editing methods excel at static attributes but fail at complex Human-Object Interactions (HOI), a critical challenge unaddressed by existing benchmarks that conflate HOI with static attributes, relying on global metrics incapable of simultaneously assessing dynamic interaction validity and entangled human-object pair preservation. Thus, we first introduce HOI-Edit, a comprehensive benchmark with three progressive cognitive levels, which features an automated metric HOI-Eval that reliably evaluates instance-level interaction by letting VLM Q&A after thinking with images containing grounded Human-Object pairs. Considering the task’s essence of remodeling dynamic relationships, we benchmark Image-to-Video (I2V) models, finding them inherently suited for dynamic editing due to their temporal generation capabilities. Crucially, beyond superior performance, this capability provides a “replay of the failure process,” offering unique diagnosability into why errors occur. We thus propose SCPE (Self-Correcting Process Editing), a novel, agentic self-correcting framework that constrains the generation of I2V models through iteratively refined prompts, enabling the generated videos to more accurately present the target HOI. Extracted frames from these videos are the final editing results. On HOI-Edit, SCPE achieves performance competitive with state-of-the-art (SOTA) editing models like Nano Banana on interaction. Code is available at https://github.com/oceanflowlab/HOI-Edit.


[53] AMALIA-VL: A Native European Portuguese Open-Source Vision and Language Model cs.CVPDF

Diogo Glória-Silva, João Cardeira, Manuel Letras da Luz, Afonso Simplício, Gonçalo Vinagre

TL;DR: 本文介绍了AMALIA-VL,这是首个针对欧洲葡萄牙语(pt-PT)原生构建的开源指令调优大型视觉语言模型。它通过高分辨率视觉编码器、动态图像分块和完全开源的pt-PT优化语言模型,结合专门设计的三阶段训练流程和以pt-PT为中心的多模态数据混合,旨在解决现有开源多模态模型对欧洲葡萄牙语服务不足的问题。

Details

Motivation: 现有开源多模态模型要么将欧洲葡萄牙语与巴西葡萄牙语混为一谈,要么在其训练数据中严重缺乏欧洲葡萄牙语内容,导致该语言被系统性忽视,因此需要构建一个专门针对欧洲葡萄牙语的原生视觉语言模型。

Result: 评估表明,AMALIA-VL为开源pt-PT LVLM建立了一个强大的基线,作者将发布模型权重、训练数据和构建流水线以及机器翻译的pt-PT评估基准。

Insight: 创新点包括:专门为欧洲葡萄牙语优化的完全开源语言模型与视觉编码器的结合;包含视觉语言对齐、通用视觉指令调优和偏好优化的三阶段训练流程;以及结合了精选、翻译公共数据集和新型数据集的pt-PT中心多模态数据混合,以解决该语言多模态资源匮乏的问题。

Abstract: Large Vision and Language Models (LVLMs) have advanced rapidly, yet European Portuguese (pt-PT) remains systematically underserved by existing open-source multimodal models, which either conflate it with Brazilian Portuguese or severely under-represent it in their training data mixes. We introduce AMALIA-VL, the first open-source instruction-tuned LVLM built natively for pt-PT, pairing a high-resolution vision encoder with dynamic image tiling and a fully open pt-PT-optimized language model via a learned connector. We contribute with a purposefully designed three-stage training process - vision-language alignment, general visual instruction tuning, and preference optimization - together with a pt-PT-centric multimodal data mix combining curated and translated public datasets with novel datasets that address the near-total absence of European Portuguese multimodal resources. Our evaluation shows that AMALIA-VL establishes a strong baseline for open-source pt-PT LVLMs.We will release model weights, training data, and construction pipelines along with machine-translated pt-PT evaluation benchmarks to help democratize pt-PT LVLM development.


[54] ProductConsistency: Improving Product Identity Preservation in Instruction-Based Image Editing via SFT and RL cs.CV | cs.AIPDF

Mukund Khanna, Raj Singh Yadav, Kunal Singh

TL;DR: 该论文针对基于指令的产品图像编辑中难以保持产品身份(如特征、品牌、文本元素)一致性的问题,提出了ProductConsistency数据集和方法。该方法包括一个用于监督微调(SFT)的87k样本数据集、一个包含869个独特产品图像的强化学习(RL)数据集,以及一个新的基准测试数据集。通过引入循环一致性奖励来指导RL训练,并在Qwen-Image-Edit-2511和Flux.1-Kontext-dev模型上微调,显著提升了产品一致性、文本渲染和整体视觉质量。

Details

Motivation: 当前基于指令的图像编辑模型在产品中心场景中难以保持产品特征、品牌和文本元素等细粒度对象身份的一致性,且缺乏针对此类任务的带文本保真度约束的数据集。

Result: 在提出的ProductConsistency Benchmark上评估,微调后的模型在OCR、感知指标和基于MLLM的评估中均优于基线模型,其中Qwen-Image-Edit-2511模型实现了字符错误率降低5倍,表明产品一致性、文本渲染和视觉质量得到显著提升。

Insight: 论文的创新点在于构建了首个专注于产品身份保持的指令编辑数据集(ProductConsistency),并提出了用于强化学习的循环一致性奖励机制,通过对比原始产品描述与编辑后图像生成的描述来强制语义保持,有效解决了产品特征和文本保真度问题。

Abstract: Recent advances in instruction-based image editing have enabled models to perform complex visual edits from natural language instructions. However, in product-centric scenarios where preserving product features, branding, and textual elements are critical, current open and closed source models often struggle to maintain this fine-grained object identity. This issue is further compounded by the lack of datasets for instruction-based product image editing with text fidelity constraints, leaving it largely treated as an implicit capability of instruction-based image editing models. In this work, we introduce the ProductConsistency dataset which is designed to improve product-centric image editing. Our approach includes a supervised fine-tuning (SFT) dataset of 87k samples for product editing, a reinforcement learning (RL) dataset with 869 unique product images, and a new benchmark dataset, the ProductConsistency Benchmark, to allow rigorous and standardized evaluation of editing models. To guide RL training, we propose a Cyclic Consistency reward that enforces semantic preservation of product identity by using caption similarity between the original product description and captions generated from the edited image. We fine-tune both Qwen-Image-Edit-2511 and Flux.1-Kontext-dev using our dataset and demonstrate consistent improvements over baseline models in OCR and Perceptual metrics, and MLLM-based evaluations as well, indicating stronger product consistency, text rendering, and overall visual quality; with the Qwen-Image-Edit-2511 model achieving a 5x reduction in the character error rate. The code and pipeline is available at https://anonymous.4open.science/r/ProductConsistency-6FCC/README.md


[55] Hand-4DGS: Feed-Forward 3D Gaussian Splatting for 4D Hand Reconstruction from Egocentric Videos cs.CVPDF

Jeongmin Bae, Seoha Kim, Marc Pollefeys, Mahdi Rad, Youngjung Uh

TL;DR: 本文提出了Hand-4DGS,首个从前馈式框架直接从第一人称视角视频重建动态4D手部模型,实现了快速推理和强泛化能力。该方法结合了网格引导表示以引入结构先验,并使用时间卷积来建模动态运动,无需昂贵的3D手部姿态真值标注。

Details

Motivation: 解决从单视角、存在快速头部运动、剧烈手部动态、严重遮挡和固有模糊性的第一人称视频中,进行动态3D手部重建的挑战,这对于AR/VR和AI眼镜等下一代计算平台至关重要。

Result: 在两个具有挑战性的第一人称数据集H2O和ARCTIC上进行了评估,相比基线方法取得了显著改进。

Insight: 创新点在于将前馈网络的泛化能力与通过高斯泼溅实现的有效的2D图像监督相结合,避免了昂贵的3D标注需求;同时,网格引导表示和时间卷积的引入,为单视角动态手部重建提供了结构先验和时序建模的有效方案。

Abstract: Dynamic 3D hand reconstruction from egocentric videos is essential for next-generation computing platforms such as AR/VR and AI glasses. Despite its importance, most prior works focus either on multi-view 3D hand reconstruction or on 4D human body reconstruction. Egocentric 4D hand reconstruction remains challenging due to fast head motion, rapid hand dynamics, severe occlusions, and inherent ambiguity from single-view observations. To address these challenges, we introduce Hand-4DGS, the first feed-forward framework for reconstructing dynamic 4D hands directly from egocentric videos, enabling both fast (~60 FPS) inference and strong generalization. Our approach incorporates a mesh-guided representation for structural priors and temporal convolutions to model dynamic motion. We evaluate our framework on two challenging egocentric datasets, H2O and ARCTIC, and demonstrate significant improvements over baselines. Our method benefits from the generalization capability of feed-forward networks and effective 2D image supervision through Gaussian splatting, without requiring expensive 3D hand pose ground-truth annotations.


[56] Transformer Geometry Observatory TGO-I: Spectral Geometry Observatory cs.CV | cs.LGPDF

Kaustubh Kapil, Kishor P. Upla

TL;DR: 本文介绍了Transformer Geometry Observatory (TGO)框架,旨在系统研究Vision Transformers (ViTs)的表征几何与动态特性。TGO-I作为该框架的首个部分,专注于分析ViT表征的谱几何,通过在ImageNet-100上训练的ViT-Small/16模型,考察了训练过程中有效秩、稳定秩、参与比、谱熵、谱平坦度、谱各向异性、协方差结构、特征谱和奇异值谱等指标。研究发现,训练过程中维度利用率持续增加,各向异性降低,谱熵和参与比上升,特征谱逐渐平坦化,表明方差在表征维度间重新分布,而非集中于少数主导方向,这一现象在最终的CLS令牌表征中尤为明显。

Details

Motivation: 尽管Vision Transformers (ViTs)在计算机视觉领域广泛应用并取得成功,但其维度和表征几何的基本理解仍相对不足。本文旨在填补这一空白,通过建立TGO框架来系统探究ViTs的表征几何和动态特性。

Result: 在ImageNet-100上训练的ViT-Small/16模型分析显示,训练过程中维度利用率一致增加,各向异性降低,谱熵和参与比上升,特征谱逐渐平坦化。最终CLS令牌表征表现出网络中最高的有效维度和最低的各向异性。这些结果为理解ViT表征的几何演化提供了定量依据。

Insight: 论文的创新点在于提出了TGO这一系统框架来研究ViTs的表征几何,特别是通过谱几何分析揭示了训练过程中方差在维度间重新分布的现象,挑战了传统认为训练应集中信息于少数主导方向的直觉。这为深入理解Transformer模型的内部表征机制提供了新的视角和方法。

Abstract: Despite the widespread adoption of Vision Transformers (ViTs) and their success across numerous computer vision applications, the fundamental understanding of their dimensional and representational geometry remains relatively underexplored. To address this gap, we introduce Transformer Geometry Observatory (TGO), a systematic framework of experiments and analysis pipelines designed to investigate the representational geometry and dynamics of Vision Transformers. TGO-I, the first installment of the framework, focuses on the spectral geometry of ViT representations. Using a ViT-Small/16 model trained on ImageNet-100, we analyze Effective Rank, Stable Rank, Participation Ratio, Spectral Entropy, Spectral Flatness, Spectral Anisotropy, covariance structure, eigenspectra, and singular value spectra throughout training. Our results reveal a consistent increase in dimensional utilization, accompanied by decreasing anisotropy, increasing spectral entropy, increasing participation ratio, and progressively flatter eigenspectra. Contrary to the common intuition that training should concentrate information into a small number of dominant directions, we observe a progressive redistribution of variance across representational dimensions. This phenomenon is particularly pronounced in the final CLS token representation, which exhibits the highest effective dimensionality and lowest anisotropy within the network.


[57] OneCanvas: 3D Scene Understanding via Panoramic Reprojection cs.CV | cs.AI | cs.LG | cs.ROPDF

Bartłomiej Baranowski, Dave Zhenyu Chen, Matthias Nießner

TL;DR: OneCanvas提出了一种通过全景重投影实现3D场景理解的新方法,它将多视角的patch特征聚合到单一的全景画布上,无需复杂的几何编码器或大量训练预算。该方法通过将patch反投影到3D世界坐标并映射到画布的经纬度上,同时添加3D位置嵌入来恢复深度信息,使预训练的视觉语言模型能够像处理普通图像一样处理这种表示。

Details

Motivation: 现有3D场景理解方法依赖复杂的几何编码器或大量训练资源,OneCanvas旨在提供一种更简单高效的替代方案,通过全景重投影统一多视角特征,支持机器人学和具身AI中常见的特定视点推理需求。

Result: OneCanvas在SQA3D和VSI-Bench基准测试中达到了最先进的准确率,在SPBench上表现出良好的分布外泛化能力,且训练计算量比最强竞争方法少一个数量级。

Insight: 创新点包括使用单一全景画布统一多视角特征的空间坐标系,无需主干网络重大修改;引入空间预训练课程,通过程序化放置物体patch特征生成动态监督数据,减少空间推理捷径;支持任意视点中心的表示,直接适用于具体场景推理。

Abstract: Existing approaches to 3D scene understanding in Vision-Language Models (VLMs) either rely on complex, model-specific geometry encoders or large training budgets in pursuit of spatial reasoning. Instead, OneCanvas aggregates patch features from all views onto a single equirectangular panoramic canvas. Namely, each patch is unprojected to a 3D world coordinate using its depth and camera pose, then placed on the canvas at the continuous longitude and latitude of that point as seen from the canvas origin, with no rasterization or aggregation across overlapping views. A 3D position embedding of the patch’s metric coordinates is added to its feature, restoring the depth lost when collapsing the world position to an angular canvas coordinate. Patches from all frames thus share one spatial coordinate system with no fusion or major architectural modifications of the backbone. The pretrained VLM consumes this representation as if it were an ordinary image. Because the canvas can be centered on any pose of interest, the same representation directly supports situated reasoning from a specific viewpoint, a common requirement in robotics and embodied AI. Thanks to this representation, we can also introduce a spatial pretraining curriculum: by procedurally placing patch features of objects, drawn from real images, at chosen 3D world positions on an otherwise empty canvas, we generate on-the-fly supervision spanning a broad range of spatial reasoning tasks, with answer distributions controlled to reduce spatial reasoning shortcuts. OneCanvas achieves state-of-the-art accuracy on SQA3D and VSI-Bench, and generalizes to out-of-distribution data on SPBench, using an order of magnitude less training compute than the strongest competing methods.


[58] CABLE: Cloud-Assisted Bandwidth-efficient LMM-based Encoding for V2X Systems cs.CV | cs.ROPDF

Haohua Que, Zhipeng Bao, Qianyi Wu, Handong Yao

TL;DR: CABLE是一个面向车联网系统的云辅助带宽高效LMM编码框架,通过利用前一帧的云分割掩码在边缘端进行运动补偿和区域优化,仅上传感兴趣区域(ROI)掩码图像,从而大幅减少通信开销和云端预填充延迟。

Details

Motivation: 解决云端大型多模态模型在车联网系统中因传输全分辨率帧导致的通信开销大和云端预填充延迟高的问题。

Result: 在nuScenes、WOD-ZB、Waymo、KITTI和CADC五个数据集上实验显示,通信节省效果一致,ROI像素覆盖减少73%至87%,LMM预填充速度估计提升5至8倍,检测质量仅有适度下降。

Insight: 创新点在于构建了一个掩码到ROI再到LMM的反馈循环,利用运动补偿和走廊包络技术动态生成鲁棒的ROI,实现了边缘与云的高效协同感知。

Abstract: Cloud-hosted large multimodal models (LMMs) can provide strong open-vocabulary perception for Vehicle-to-Everything systems, but naively transmitting full-resolution frames from edge to cloud causes severe communication overhead and high cloud-side prefill latency. We present CABLE, a cloud-assisted bandwidth-efficient LMM-based encoding framework for edge-cloud perception. CABLE propagates the previous cloud segmentation mask on the edge using ego-motion compensation, refines it with residual-motion cues, and consolidates disconnected regions via a corridor envelope to form a robust region of interest (ROI). Only ROI-masked images are uploaded, while the cloud segmentation output is fed back as the prior for the next frame, forming a mask-to-ROI-to-LMM feedback loop. Experiments on five datasets (nuScenes, WOD-ZB, Waymo, KITTI, and CADC) show consistent communication savings while largely preserving perception, achieving $73$–$87%$ ROI pixel-coverage reduction with $5$–$8\times$ estimated LMM prefill speedup at a modest detection-quality trade-off relative to full-frame inference.


[59] A Multi-Domain Benchmark for Detecting AI-Generated Text-Rich Images from GPT-Image-2 cs.CV | cs.AIPDF

Yijin Wang, Shuyi Wang, Wenhan Zhang, Yuqi Ouyang

TL;DR: 本文提出了一个针对GPT-Image-2生成的富文本图像检测的多领域基准测试,包含商业海报、信息图等六个类别共8602张图像。研究评估了五种代表性检测器在零样本设置下的性能,发现其表现高度依赖领域且对JPEG压缩敏感,并探索了多模态视觉语言模型在此任务上的潜力与局限。

Details

Motivation: 随着多模态图像生成模型能合成逼真的文本内容和结构化视觉设计,检测AI生成的富文本图像对于数字信任和内容真实性至关重要,而现有基准主要关注以物体为中心的图像,对文本语义和布局组织为核心的场景覆盖有限。

Result: 在零样本设置下评估了五种检测器,结果显示检测器性能高度依赖领域,即使在部分类别表现良好的方法在其他类别也可能失效,且最强的传统检测器对JPEG压缩表现出严重敏感性。

Insight: 创新点在于构建了首个针对GPT-Image-2生成富文本图像的多领域基准,揭示了检测任务对文本和布局感知的需求,并初步探索了多模态模型在此结构化格式任务上的应用前景与不足。

Abstract: Text-rich images often contain privacy-sensitive, transactional, or decision-relevant information. As recent multimodal image generation models become increasingly capable of synthesizing realistic textual content and structured visual designs, detecting AI-generated text-rich images has become an important challenge for digital trust and content authenticity. Existing benchmarks, however, largely focus on object-centric images and provide limited coverage of scenarios where textual semantics and layout organization are central. In this paper, we introduce a multi-domain benchmark for detecting text-rich images generated by OpenAI’s GPT Image 2. The benchmark contains 8,602 images across six representative categories: commercial posters, infographics, academic posters, receipts, tables, and UI screenshots. Using this benchmark, we evaluate five representative AI-generated image detectors in a zero-shot setting and analyze their overall, category-wise, and post-processing robustness. Our results show that detector performance is highly domain-dependent: methods that perform well in some categories often fail on others, and even the strongest conventional detector exhibits severe sensitivity to JPEG compression. We further conduct an exploratory evaluation with a multimodal vision-language model, revealing both its promise and its limitations on structured formats. These findings highlight the need for text- and layout-aware detection methods for modern AI-generated images. Our dataset is released at XXX.


[60] A Unified Framework for Efficient Remote Sensing Visual Question Answering: Adapting Dual, Hybrid, and Encoder-Decoder Architectures cs.CVPDF

Timothy Agboada, Shikha Chandel, Yadav Raj Ghimire, Leila Hashemi-Beni

TL;DR: 该论文提出了一个统一的参数高效微调框架RS Adapter,用于遥感视觉问答任务。该框架通过向三种不同的视觉语言模型架构(双编码器CLIP、编码器-解码器BLIP和混合架构FLAVA)中注入轻量级瓶颈适配器,实现了在冻结主干网络的情况下,仅使用不到5%的可训练参数进行快速领域适应。

Details

Motivation: 遥感领域的视觉问答面临高分辨率、多尺度目标分布和语义复杂性的独特挑战,而通用领域基础模型直接应用于RSVQA存在巨大的领域偏移和全参数微调计算成本过高的问题。

Result: 在高分辨率RSVQA-x数据集上的实验结果表明,所有适配后的模型均能收敛,其中混合架构FLAVA在保持多模态推理和检索能力方面优于单模态架构,为灾害评估和城市监测中的资源高效VQA建立了新的基准。

Insight: 论文的创新点在于提出了一个统一的架构手术流程,将轻量级适配器注入冻结模型的注意力和MLP层,实现了参数高效的领域适应。从客观角度看,其对不同VLM架构的对比分析为遥感VQA的模型选择提供了有价值的见解,特别是证明了混合架构在平衡能力与效率方面的优势。

Abstract: Visual Question Answering (VQA) in the Remote Sensing (RS) domain presents unique challenges due to the high resolution, multi scale object distribution, and semantic complexity of aerial imagery. While general domain Foundation Models have achieved remarkable success, their direct application to RSVQA is hindered by massive domain shifts and the computationally prohibitive nature of full fine tuning. This study presents a comparative analysis of RS Adapter, a Parameter Efficient Fine Tuning (PEFT) strategy, applied across three distinct Vision Language Model (VLM) architectures: the Dual Encoder CLIP, the Encoder Decoder BLIP, and the Hybrid FLAVA. We introduce a unified architectural surgery pipeline that injects lightweight bottleneck adapters into the attention and MLP layers of frozen backbones, enabling rapid adaptation with less than 5 percent of trainable parameters. Experimental results on the high resolution RSVQA x dataset demonstrate that while all adapted models achieve convergence, the Hybrid FLAVA architecture offers a superior balance of multimodal reasoning and retrieval capabilities compared to its unimodal counterparts. Our findings establish a new baseline for resource efficient VQA in disaster assessment and urban monitoring.


[61] Native Active Perception as Reasoning for Omni-Modal Understanding cs.CV | cs.CL | cs.SDPDF

Zhenghao Xing, Ruiyang Xu, Yuxuan Wang, Jinzheng He, Ziyang Ma

TL;DR: 本文提出OmniAgent,一种基于POMDP的主动感知代理,将长视频理解建模为迭代的观察-思考-行动循环。它通过按需执行动作(如聚焦特定片段)将视听线索选择性地提炼为持久文本记忆,从而将推理复杂度与原始视频时长解耦。该方法通过智能体监督微调和基于回合的自适应不确定性优势强化学习进行训练,并在多个基准测试中达到开源模型中的最先进性能。

Details

Motivation: 解决传统长视频理解被动模型(‘全看’范式)计算成本随视频时长线性增长的问题,以及现有交互式框架仍需全局预扫描、上下文成本仍与视频长度相关的问题。

Result: 在十个基准测试(如VideoMME、LVBench)上达到开源模型中的最先进性能。在LVBench上,其7B参数的模型性能优于大10倍的Qwen2.5-VL-72B模型(50.5% vs. 47.3%)。

Insight: 核心创新在于将视频理解形式化为基于POMDP的主动感知循环,实现了推理复杂度与视频时长的解耦。方法上的创新包括:通过最佳轨迹合成和双阶段质量控制的智能体监督微调来引导主动感知,以及利用回合级熵进行关键发现回合信用分配的TAURA强化学习策略。模型展现出正向的测试时扩展性,即性能随推理回合数增加而提升,验证了主动感知的有效性。

Abstract: Passive models for long video understanding typically rely on a “watch-it-all” paradigm, processing frames uniformly regardless of query difficulty, causing computational cost to grow with video duration. Although interactive frameworks have emerged, they often rely on global pre-scanning, and their context cost still scales with video length. We propose OmniAgent, the first native omni-modal agent that formulates video understanding as a POMDP-based iterative Observation-Thought-Action cycle. OmniAgent executes on-demand actions to selectively distill audio-visual cues into a persistent textual memory, effectively decoupling reasoning complexity from raw video duration. To operationalize this, we introduce (1) Agentic Supervised Fine-Tuning to bootstrap native active perception via best-of-N trajectory synthesis with dual-stage quality control, and (2) Agentic Reinforcement Learning with TAURA (Turn-aware Adaptive Uncertainty Rescaled Advantage), which leverages turn-level entropy to steer credit assignment toward pivotal discovery turns. Crucially, OmniAgent exhibits positive test-time scaling, where performance improves as the number of reasoning turns increases, validating the efficacy of active perception. Empirical results across ten benchmarks (e.g., VideoMME, LVBench) demonstrate that OmniAgent achieves state-of-the-art performance among open-source models. Notably, on LVBench, our 7B agent outperforms the 10$\times$ larger Qwen2.5-VL-72B (50.5% vs. 47.3%).


[62] Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games cs.CVPDF

Shengyuan Ding, Xilin Wei, Xinyu Fang, Haodong Duan, Dahua Lin

TL;DR: 本文提出了RNG-Bench(重构性非马尔可夫游戏)基准测试套件,旨在评估多模态大语言模型在需要基于历史不可见观察进行决策的多步交互任务中的能力。该基准包含‘配对游戏’和‘3D迷宫’两个互补游戏,通过控制网格大小、视觉模式和观察模态三个难度轴进行统一评估,并引入了头对头对决协议和‘记忆间隙’指标来分离遗忘与决策错误。实验表明,前沿MLLMs在需要处理约128K tokens和350张图像输入的困难配置上表现远未饱和,且主要错误源于遗忘而非决策。通过对Qwen3.5-9B进行微调,模型在RNG-Bench上的性能得到提升,并能迁移至现有基准而不损害通用多模态能力。

Details

Motivation: 当前将多模态基础模型部署为闭环策略时,越来越需要模型基于不再可见的历史观察来采取行动。然而,现有基准要么暴露完整状态,要么将隐藏状态重构与其他智能体技能混为一谈,要么仅在回合结束后测试回忆能力,无法有效评估模型在交互过程中的历史观察重构与决策能力。

Result: 在RNG-Bench基准上,前沿多模态大语言模型在最困难的配置(每回合约128K tokens和350张图像输入)上表现远未饱和。通过‘记忆间隙’指标分析发现,大部分剩余错误源于对早期观察的遗忘,而非次优的决策制定。对Qwen3.5-9B模型在最优策略轨迹和过滤模型演示上进行微调,不仅提升了其在RNG-Bench上的性能,还能迁移到现有基准上而不损害通用多模态能力。

Insight: 论文的核心创新在于设计了一个专门用于隔离和评估模型在交互过程中重构历史观察并据此行动能力的基准测试RNG-Bench,其通过控制难度轴、引入头对头对决协议和‘记忆间隙’指标,提供了更精细的评估框架。从客观角度看,该工作强调了在评估多模态智能体时区分‘记忆能力’与‘决策能力’的重要性,并为未来模型在需要长期记忆的序列决策任务上的改进提供了明确的评估方向和数据集支持。

Abstract: Deploying multimodal foundation models as closed-loop policies increasingly requires conditioning actions on observations that are no longer visible. However, existing benchmarks either expose the full state, conflate hidden-state reconstruction with other agent skills, or test recall only after an episode has ended. We introduce RNG-Bench (Reconstructive Non-Markov Games), a benchmark suite designed to isolate a base model’s ability to reconstruct past observations and act on them during multi-step interaction. RNG-Bench includes two complementary games: Matching Pairs, where card identities briefly revealed at specific locations must later be recalled, and 3D Maze, where egocentric views must be integrated into a spatial map. Both games are evaluated under a unified harness with three controlled difficulty axes: grid size, visual pattern, and observation modality. The benchmark further introduces a head-to-head duel protocol to control for instance-level variance and a Memory Gap metric that disentangles forgetting from poor action selection. The hardest configurations require contexts of roughly 128K tokens and 350 image inputs per episode, and remain far from saturated by frontier MLLMs. Memory Gap analysis shows that most residual errors stem from forgetting earlier observations rather than from suboptimal decision making. Finally, fine-tuning Qwen3.5-9B on optimal-policy rollouts and filtered model demonstrations improves performance on RNG-Bench and transfers to existing benchmarks without degrading general multimodal capability.


cs.AI [Back]

[63] ForecastBench-Sim: A Simulated-World Forecasting Benchmark cs.AI | cs.CL | cs.LGPDF

Jaeho Lee, Nick Merrill, Ezra Karger

TL;DR: ForecastBench-Sim是一个基于Freeciv游戏模拟的预测基准测试,旨在通过可控的、可立即验证的模拟任务来评估通用AI系统的概率推理能力。它通过游戏状态快照生成关于未来隐藏状态的预测问题,并自动运行模拟来评分预测结果。

Details

Motivation: 解决现实世界预测基准测试中结果验证缓慢、尾部事件罕见、反事实问题难以评分等约束问题,为研究动态世界状态下的概率推理提供可控环境。

Result: 论文描述了基准测试的流程、问题家族、评分协议和发布成果,并报告了模型评估和匿名人类试点的验证切片,表明其能有效生成多样化的预测任务。

Insight: 创新点在于利用策略游戏模拟生成任意时间范围、条件或因果问题以及罕见事件的预测任务,为AI预测能力评估提供了一个可扩展、可重复的补充基准。

Abstract: Forecasting benchmarks for general-purpose AI systems usually inherit the constraints of the real world: outcomes resolve slowly, tail events are rare, and counterfactual questions are difficult to score. We introduce ForecastBench-Sim, a simulated-world forecasting benchmark built on game rollouts from Freeciv, a turn-based strategy game modelled on the Civilization series. Forecasters receive a fixed world report (a structured snapshot of the current game state) and answer questions about hidden future states; the benchmark then continues the simulation and scores forecasts. Because the world is simulated, the same setup can generate continuous or binary forecasting questions at arbitrary time horizons, paired intervention worlds for conditional or causal questions, and resolved examples of rare or disruptive outcomes. We describe the benchmark pipeline, question families, scoring protocol, and release artifacts, and report validation slices from model evaluations and an anonymized human pilot. ForecastBench-Sim is intended to complement real-world forecasting benchmarks by providing controlled, immediately resolvable tasks for studying probabilistic reasoning under dynamic world states.


[64] Decoupling Search from Reasoning: A Vendor-Agnostic Grounding Architecture for LLM Agents cs.AI | cs.CL | cs.IR | cs.MAPDF

Emmanuel Aboah Boateng, Kyle MacDonald, Amardeep Kumar, Siddharth Kodwani, Sudeep Das

TL;DR: 本文提出了一种名为解耦搜索基础(DSG)的供应商无关架构,用于将LLM代理中的实时搜索基础从推理模型中分离出来。DSG通过一个与MCP兼容的网关,将供应商路由、上下文渲染、缓存等控制作为一等公民暴露出来,从而解决了原生搜索基础将多个方面耦合在单一模型提供商边界内的问题。

Details

Motivation: 解决生产环境中LLM代理依赖实时搜索时,原生搜索基础将检索策略、提供商选择、证据注入、成本、延迟和生成行为捆绑在单一模型提供商边界内所导致的难以检查、调整、重用或移植的问题,以及可能破坏严格输出约定的搜索诱导冗长问题。

Result: 在SimpleQA、FreshQA和HotpotQA三个基准测试上,使用五个前沿模型进行实验。在FreshQA上,原生搜索在时效性敏感任务上领先,但DSG在控制性重要的场景下展现出更强优势:在SimpleQA上,DSG以91%更低的搜索成本几乎匹配了原生搜索的准确率(86.1% vs. 87.7%),保持了简洁的答案约定,并在热缓存下达到了99.4%的命中率和68%更低的延迟。在一个电子商务查询理解(QIU)工作负载上,DSG匹配或略微超过了原生搜索的准确率,同时将搜索成本降低了98%以上。

Insight: 核心创新点在于将实时搜索基础视为一个可优化的接口边界,而非固定的模型特性,通过解耦架构实现了供应商无关性、可控性和成本效益。这为大规模代理工作负载提供了可重用、可移植且高效的基础层,其设计理念(如暴露控制、缓存机制)对构建模块化和高效的LLM代理系统具有借鉴意义。

Abstract: Production LLM agents increasingly depend on real-time search, yet native search grounding bundles retrieval policy, provider choice, evidence injection, cost, latency, and generation behavior behind a single model-provider boundary. This coupling makes grounding hard to inspect, tune, reuse, or port, and can trigger Search-Induced Verbosity that breaks strict output contracts. We present Decoupled Search Grounding (DSG), a vendor-agnostic boundary that moves grounding outside the reasoning model through an MCP-compatible gateway, exposing provider routing, source-aware context rendering, configured fallback, retrieval-depth control, and exact plus semantic caching as first-class controls. Across five frontier models on SimpleQA, FreshQA, and HotpotQA, native search leads on recency-sensitive FreshQA, but DSG exposes a stronger frontier when control matters: on SimpleQA it nearly matches native accuracy (86.1% vs. 87.7%) at 91% lower search cost, preserves concise answer contracts, and reaches a 99.4% warm-cache hit rate with 68% lower latency. Deployed as a shared production grounding layer for large-scale agentic workloads with interchangeable models, DSG matches or slightly exceeds native-search accuracy on an e-commerce query-understanding (QIU) workload while cutting search cost by over 98%. Real-time grounding is best treated as an optimizable interface boundary, not a fixed model feature.


[65] Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation cs.AI | cs.CLPDF

Siyi Gu, Jialin Chen, Sophia Zhou, Arman Cohan, Rex Ying

TL;DR: 本文提出了一种名为‘Rubric-Conditioned Self-Distillation’的框架,旨在改进推理语言模型的训练后优化。该方法利用评分标准作为结构化、细粒度的反馈,通过条件化教师模型来对学生模型自身采样的推理轨迹提供词元级别的指导,从而避免了对单一参考推理链的依赖。

Details

Motivation: 现有方法存在局限性:监督蒸馏依赖昂贵且可能不完美的思维链标注;而基于可验证奖励的强化学习则将评估反馈压缩为标量信号,掩盖了响应中需要改进的具体方面。

Result: 在多个科学推理基准测试上,该方法平均超越了GRPO 1.0分和OPSD 0.9分,表明其能有效将评分标准级别的准则转化为对推理过程的词元级别指导。

Insight: 核心创新在于引入评分标准作为结构化反馈机制,并设计了基于条件化教师模型的自我蒸馏框架,实现了比标量奖励优化更细粒度的信用分配,避免了依赖单一不完美参考监督的问题。

Abstract: Post-training of reasoning language models is commonly driven by supervised distillation and reinforcement learning with verifiable rewards. Distillation often relies on chain-of-thought annotations that are expensive to obtain and may themselves be noisy, incomplete, or partially incorrect; even when the final solution is correct, an imperfect rationale can interfere with learning. Reinforcement learning with verified rewards, on the other hand, typically compresses evaluative feedback into a scalar signal, obscuring which aspects of a response should be improved. We propose \textbf{Rubric-Conditioned Self-Distillation}, a framework that incorporates rubrics as structured, fine-grained feedback for on-policy self-distillation. Our method conditions the teacher model on criterion-level rubrics and uses it to provide token-level guidance on the student’s own sampled trajectories. This design avoids treating a single reference rationale as the sole supervision target. Instead, rubrics specify what a strong response should satisfy, enabling more fine-grained credit assignment over the reasoning process than scalar reward optimization. We instantiate this framework with a two-stage pipeline that first learns to generate task-specific rubrics and then trains a rubric-guided reasoner. We evaluate on a diverse suite of science reasoning benchmarks and results show that rubric-conditioned self-distillation effectively converts rubric-level criteria into token-level guidance over the reasoning process, surpassing GRPO by 1.0 points and OPSD by 0.9 points on average.


cs.RO [Back]

[66] SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation cs.RO | cs.CVPDF

Wei-Cheng Tseng, Gashon Hussein, Yuzhu Dong, Allen Z. Ren, Lucy X. Shi

TL;DR: 本文提出了SC3-Eval,一种通过自一致视频生成来评估机器人基础模型的方法。该方法通过强制实施前向-逆向动力学一致性、跨视角一致性和测试时一致性,将预训练的视频基础模型转化为一个准确的政策评估器,以解决在模拟策略部署时因自回归累积误差、多视角不一致以及评估分布外行为所带来的挑战。

Details

Motivation: 在现实世界中评估通用机器人操作策略成本高、速度慢且难以扩展。基于动作条件的视频世界模型提供了一种可扩展的替代方案,但面临自回归累积误差、多视角观测不一致以及需要泛化到训练分布外的策略行为等挑战。

Result: 在七个真实世界视觉-语言-动作策略上,SC3-Eval实现了0.929的闭环皮尔逊相关系数和0.119的MMRV,优于三个强大的先前基于视频模型的基线,并能泛化到新任务。

Insight: 创新点在于提出了三种互补的一致性约束来提升视频生成模型的评估准确性:前向-逆向动力学一致性锚定物理可行的动作流形,跨视角一致性确保长序列中多视角观测的连贯性,测试时一致性利用逆向动力学作为不确定性信号来终止漂移的生成序列。该方法还能复现策略在真实部署中的失败模式,支持细粒度的诊断比较。

Abstract: Evaluating generalist robot manipulation policies in the real world is expensive, slow, and difficult to scale. Action-conditioned video world models offer a scalable alternative by simulating policy rollouts. Autoregressive rollouts accumulate compounding errors, observations across multiple camera views must remain mutually consistent, and the evaluator must generalize to policies whose behaviors lie outside the training distribution. We address these challenges with SC3-Eval, a self-consistent video generation recipe that adapts a pre-trained video foundation model into an accurate policy evaluator by enforcing three complementary forms of consistency. First, forward-inverse dynamics consistency jointly trains the model to predict frames from actions and to recover actions from frames, anchoring generated rollouts to a physically plausible action manifold and counteracting the drift a forward-only model cannot penalize. Second, cross-view consistency trains the model to inpaint each camera view from the other, keeping the multi-camera observation coherent over long rollouts without any explicit memory mechanism. Third, test-time consistency reuses the inverse dynamics mode at inference as a per-action-chunk uncertainty signal that terminates rollouts whose generated frames drift away from the requested actions. We also demonstrate SC3-Eval rollouts reproduce the failure modes that policies exhibit in real-world rollouts, supporting fine-grained diagnostic comparison rather than aggregate ranking alone. Across seven real-world vision-language-action policies, SC3-Eval attains a closed-loop Pearson correlation of $0.929$ and MMRV of $0.119$, outperforming three strong prior video-model-based baselines, and generalizes to new tasks.


[67] Sensor Configuration Matters: A Systematic Evaluation of Multimodal SLAM on Quadruped Robots cs.RO | cs.CVPDF

Roberto Corlito, Fabian Schmidt, Nils Seibert, Markus Enzweiler, Abhinav Valada

TL;DR: 本文系统评估了四足机器人上多模态SLAM的性能,重点关注传感器配置对定位精度和鲁棒性的影响。研究基于ANYmal D四足机器人采集的GrandTour数据集,比较了视觉、视觉-惯性以及激光雷达-视觉-惯性SLAM方法在不同硬件配置下的表现。结果表明,立体相机配置优于单目和RGB-D相机,全局快门相机比卷帘快门相机更能减少运动引起的跟踪失败,而标准惯性传感器集成在剧烈腿部运动下可能降低视觉主导框架的性能。

Details

Motivation: 四足机器人在动态腿部运动下会引入独特的传感器挑战,如足部冲击、高频机械振动和快速角旋转,这些因素会降低标准感知管道的性能。目前缺乏关于硬件级传感器配置如何影响四足机器人SLAM性能的系统评估,因此本文旨在填补这一空白。

Result: 在GrandTour数据集上的实验表明,立体相机配置在定位精度上一致优于单目和RGB-D模态;全局快门相机相比卷帘快门相机显著减少了运动引起的跟踪失败;标准惯性传感器集成在剧烈腿部运动下可能降低视觉主导SLAM框架的性能。这些结果提供了不同传感器配置在定位准确性、算法鲁棒性和计算资源利用之间的权衡分析。

Insight: 论文的创新点在于首次系统量化了传感器配置(包括相机模态、快门技术和惯性传感器等级)对四足机器人SLAM性能的影响,并提供了针对敏捷腿部系统定制传感器负载的具体设计指南。从客观角度看,该研究揭示了在动态腿部运动环境下,硬件选择对SLAM系统鲁棒性的决定性作用,为四足机器人感知系统设计提供了实证依据。

Abstract: Autonomous navigation of quadrupedal robots in diverse environments fundamentally relies on resilient Simultaneous Localization and Mapping (SLAM). While visual-inertial SLAM has matured across wheeled, handheld, and aerial platforms, a critical evaluation gap remains regarding how hardware-level sensor configurations affect performance under the aggressive dynamics of legged locomotion. Quadrupeds introduce distinct embodiment-induced sensory challenges, including foot-impact shocks, high-frequency mechanical vibrations, and rapid angular rotations, which degrade standard perception pipelines. To address this gap, we present a systematic evaluation of state-of-the-art visual, visual-inertial, and LiDAR-visual-inertial SLAM methods using the GrandTour dataset recorded on an ANYmal D quadruped. We isolate and quantify the impacts of camera modalities, shutter techniques, and inertial sensor tiers, analyzing their trade-offs across localization accuracy, algorithmic robustness, and computational resource utilization. Our empirical findings demonstrate that hardware selection has substantial influence on system resilience: stereo configurations consistently outperform monocular and RGB-D modalities, global shutter cameras significantly mitigate motion-induced tracking failures compared to rolling shutter cameras, and, crucially, standard inertial integration can degrade the performance of primarily vision-based frameworks under harsh legged locomotion. These insights additionally offer concrete design guidelines for tailoring custom sensor payloads to achieve dependable perception on agile legged systems.


[68] Do as I Do: Dexterous Manipulation Data from Everyday Human Videos cs.RO | cs.CVPDF

Bhawna Paliwal, Haritheja Etukuru, William Liang, Pieter Abbeel, Nur Muhammad Mahi Shafiullah

TL;DR: 本文提出DO AS I DO算法,旨在从单目RGB人类视频中重建并重定向手-物体交互信息,以生成适用于多指灵巧机器人的操作数据。该算法能够处理多种视角的野外视频,并将其转化为可在现实世界中执行的机器人动作序列。

Details

Motivation: 为了解决机器人操作数据(特别是针对类人平台如灵巧多指手)的可扩展生成问题,并克服从丰富但仅含RGB的人类视频中估计手-物体交互及跨越人-机器人形态差距的困难。

Result: DO AS I DO在估计手-物体交互和从RGB视频提取灵巧操作轨迹方面优于先前的最先进方法,实验在具有真实值的数据集和在线收集的视频剪辑数据集上进行。

Insight: 创新点在于提出一个从单目人类视频到机器人可执行动作的完整重建与重定向流程,为从业者收集人类操作数据提供了有效的实践指南,并成功跨越了人-机器人形态差距。

Abstract: How can we scalably generate data for robotic manipulation, especially on human-like platforms such as dexterous multi-fingered hands? Learning from human videos has recently emerged as a likely answer to this question. However, difficulties in estimating hand-object interaction and crossing the human-to-robot embodiment gap have hindered the adoption of abundant monocular RGB-only human videos as the primary source of robot manipulation data. In this work, we present DO AS I DO, an algorithm to reconstruct and retarget monocular RGB human videos to multi-fingered dexterous robotic hands. DO AS I DO reconstructs hand-object interactions from various egocentric and exocentric in-the-wild video sources. The algorithm then retargets these hand-object interaction estimates into a sequence of actions executable in the real world, yielding robot-complete manipulation data from disparate human videos. Overall, DO AS I DO outperforms previous state of the art in estimating hand-object interactions and extracting dexterous manipulation trajectories from RGB videos, as we show in experiments on datasets with ground truths and on a dataset of video clips collected online. Our experiments enable us to propose an efficacy playbook for practitioners collecting human data for manipulation.


cs.SD [Back]

[69] Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors cs.SD | cs.AI | cs.CVPDF

Michael Finkelson, Daniel Segal, Eitan Richardson, Shahar Armon, Nani Goldring

TL;DR: 本文提出ScenA方法,利用基于大规模野外数据预训练的文本到音频流匹配基础模型,通过多个参考语音和自由形式的自然语言提示生成多说话人音频场景,克服了传统对话系统依赖结构化监督的局限,能够生成包含背景噪声、房间声学、重叠对话和副语言事件的真实对话音频。

Details

Motivation: 现有多说话人对话系统通过结构化监督(如每轮标签、多流转录或可学习说话人嵌入)绑定说话人与话语,且仅生成纯净语音序列,缺乏真实对话的环境纹理;本文旨在利用基础模型的能力,通过参考语音和自然语言提示直接生成自然、非录音室风格的多说话人音频场景。

Result: 在CoVoMix2-Dialogue基准测试中,ScenA在说话人绑定指标上优于现有多说话人系统,同时生成包含重叠语音、情感发声和环境声音的丰富对话音频。

Insight: 创新点包括:利用预训练基础模型继承自然音频生成能力,通过参考潜在表示和轻量级身份感知位置编码实现多说话人控制,并提出高噪声偏置时间步分布以解决’参考捷径’问题,迫使模型依赖文本提示进行说话人分配,从而避免仅依赖声学相似性。

Abstract: Existing multi-speaker dialogue systems bind speakers to utterances through structured supervision: per-turn tags, multi-stream transcriptions, or learnable speaker embeddings. These systems operate within speech-only pipelines that produce clean vocal sequences without the ambient texture of real conversations. We take a different approach. Our method, ScenA, conditions a text-to-audio flow-matching foundation model, pretrained on large-scale in-the-wild data, directly on multiple reference voices and a free-form natural language prompt that describes an entire multi-speaker audio scene. Leveraging such a foundational model allows us to inherit its capacity for natural, non-studio audio: background noise, room acoustics, overlapping dialogue, and spontaneous paralinguistic events, while adding multi-speaker control without any per-turn structure. Concretely, reference latents are concatenated into the model’s token sequence and distinguished by lightweight identity-aware positional encodings. However, we identify a critical obstacle to this approach: the \textit{Reference Shortcut}. During training under standard noise schedules, the model can identify the matching reference by acoustic similarity to the noisy target, bypassing the text prompt entirely. We address this with a high-noise-biased timestep distribution that forces the model to rely on the text prompt for speaker assignment. We evaluate ScenA on the CoVoMix2-Dialogue benchmark, showing that it outperforms existing multi-speaker systems on speaker-binding metrics while generating rich conversational audio with overlapping speech, emotional vocalizations, and ambient sound. Our results demonstrate the advantage of using a general-purpose audio model conditioned on a free-form scene description, rather than passing structured dialog scripts through a speech-only pipeline.


cs.CY [Back]

[70] The Market in the Model: Latent Diffusion as Neural Economy cs.CY | cs.CVPDF

Eryk Salvaggio

TL;DR: 本文从批判视角分析潜在扩散模型(Latent Diffusion Model)的机制,将其视为一种’神经经济’(neural economy),探讨模型如何将社会交流抽象为可度量的向量,并将社会领域转化为可销售的商品。论文旨在扩展而非替代对数据集的批判,通过剖析模型各组件的历史和视觉理论,揭示其如何强化平台经济和注意力经济的逻辑。

Details

Motivation: 现有对生成图像模型的批判多集中于数据集的影响,而忽视了模型机制中嵌入的意识形态立场,导致模型被想象为’黑箱’。本文旨在通过分析潜在扩散模型的组件及其自动化决策,揭示模型如何将社会交流转化为商品化的向量,从而扩展批判的视角。

Result: 论文未提及具体的定量实验结果或基准测试,而是通过定性分析,追溯模型训练和生成流程的各个组件,揭示每个操作如何替代社会交流并强化平台经济逻辑。

Insight: 创新点在于提出’神经经济’概念,将潜在扩散模型解释为一个封闭的符号系统,抽象社会交流为可销售向量。从客观角度看,该研究提供了对生成模型社会影响的批判性框架,强调应关注社会交换而非仅聚焦版权和商品防御,以避免强化模型产生的拜物教。

Abstract: Valuable critique of generative image models within visual culture and the humanities has emphasized the role of datasets in shaping the images they produce. Yet, close studies of the ideological positions embedded into the mechanism of the models have been neglected, leaving them imagined as “black boxes.” In a bid to expand, rather than replace, dataset critique, this paper examines the mechanisms of the latent diffusion model in terms of the problems they were brought in to solve on behalf of computer vision engineers, and the decisions each component was tasked with automating. I interpret that ensemble through the histories of its parts and the theory of vision the system inscribes into every generated image. Drawing on Impett and Offert’s notion of neural exchange value, I offer this analysis to argue that the model operates as a neural economy: a contained symbolic system that abstracts social communication into commensurable vectors as it transfers the social sphere into parcels for sale. Tracing the training and generation pipelines component by component reveals what each operation displaces, and how it further entrenches the logics of platform and attention economies over social communication. The paper warns that any critique fixated exclusively on copyright and commodity defenses risks reaffirming the very fetishism the model produces, and argues instead for centering social exchange.


cs.LG [Back]

[71] Breaking the Solver Bottleneck: Training Task Generators at the Learnable Frontier cs.LG | cs.AI | cs.CLPDF

Lorenz Wolf, Connor Watts, Roger Creus Castanyer, Geoffrey Bradway, Maxwill Lin

TL;DR: 本文提出了PROPEL框架,旨在解决强化学习训练中前沿任务供给不足的瓶颈。该框架通过训练一个轻量级激活探针来预测目标求解器的通过率,从而在无需重复求解器评估的情况下,优化任务生成器以产生位于可学习前沿(即难度适中、可解决)的任务。

Details

Motivation: 随着推理和智能体模型的进步,固定的任务分布趋于饱和,而简单的合成生成方法往往产生过于简单、不可能或定义不当的任务。直接使用强化学习优化任务生成器以生成有效且可学习的任务,因每次候选任务都需要重复的求解器评估而计算成本极高,尤其是在软件工程等耗时任务中。

Result: 在数学、代码和软件工程领域的多个模型规模上,PROPEL成功地将生成任务向目标求解率方向调整。例如,在编码任务中,对于Qwen2.5-3B-Instruct求解器,位于可学习前沿的任务比例从10.1%提升至20.0%;对于Qwen2.5-7B-Instruct,从5.3%提升至12.6%。在软件工程任务中,对于未见过的仓库,Qwen3.5-27B求解器的目标求解率任务占比从9.8%提升至19.6%。

Insight: 论文的核心创新点是提出了一个求解器摊销(solver-amortized)的框架,通过一次性标注数据集训练一个激活探针来预测求解器性能,从而将昂贵的在线求解器评估替换为单次前向传播,大幅降低了任务生成器优化的计算成本。这为高效生成位于模型当前能力边缘的、用于持续学习的任务提供了一种可扩展的方法。

Abstract: The limiting resource for training agents via reinforcement learning (RL) is increasingly frontier task supply: valid, solvable tasks just difficult enough to train the current model. As reasoning and agentic models improve, fixed task distributions saturate, while naive synthetic generation yields tasks that are trivial, impossible, or ill-posed. Training a task generator with RL to optimize validity and learnability can address this bottleneck, but direct optimization requires repeated solver rollouts per candidate. For software-engineering (SWE) tasks, a single rollout can take tens of minutes; solver-in-the-loop generator training is intractable. We introduce PROPEL, a solver-amortized framework for training task generators at the targeted solve rate. PROPEL trains a lightweight activation probe on a one-time labeled corpus of generated tasks and solver outcomes. The probe predicts target-solver pass rate from a frozen generator reference model and serves as a proxy for solve rate during generator optimization, reducing generator evaluation to a single forward pass. Across math, code, and software-engineering at multiple model scales, PROPEL shifts generation toward the targeted solve rate: for coding, tasks generated at the learnable frontier increase from $10.1% \rightarrow 20.0%$ for a Qwen2.5-3B-Instruct solver and from $5.3% \rightarrow 12.6%$ for a Qwen2.5-7B-Instruct solver. For SWE, PROPEL increases the share of generations at the targeted solve rate from $9.8% \rightarrow 19.6%$ for Qwen3.5-27B on repositories not seen during training of probe and generator.


[72] Fair Cognitive Impairment Detection Through Unlearning cs.LG | cs.CL | cs.SD | eess.ASPDF

William Nguyen, Jiali Cheng, Hadi Amiri

TL;DR: 这篇论文提出了一种用于轻度认知障碍(MCI)检测的多模态框架,该框架结合了跨模态融合和基于梯度反转的去学习技术。该方法旨在通过减少模型对与任务无关的人口统计学属性(如性别和语言)的依赖,来提升检测的公平性和鲁棒性。

Details

Motivation: 从自发语音中检测MCI具有可扩展筛查的潜力,但现有模型常常利用与标签相关的人口统计学线索,导致在不同亚组(如不同性别和语言)间的性能存在巨大差距。

Result: 在多语言基准测试TAUKADIAL和PREPARE上,该方法在MCI分类任务中超越了最先进的多语言多模态基线模型,同时显著缩小了不同患者亚组(性别和语言)之间的性能差距。

Insight: 论文的核心创新点在于将梯度反转去学习技术集成到多模态框架中,以迫使共享嵌入层不编码与任务无关的人口统计学属性,从而学习到更公平、更稳健的MCI检测表征。从客观角度看,这种将公平性约束直接融入表征学习过程的方法,为解决医疗AI中的偏见问题提供了一个有效的技术路径。

Abstract: Mild Cognitive Impairment (MCI) is a medical condition characterized by a noticeable decline in memory, language, or thinking abilities. MCI detection from spontaneous speech is promising for scalable screening. However, learned models often exploit demographic cues correlated with labels, resulting in a large performance gap across subgroups. We present a multimodal framework that combines (i) cross-model fusion between modalities (speech, text, and image), and (ii) unlearning using gradient reversal that discourages the shared embedding from encoding task-irrelevant demographic attributes. Evaluated on the multilingual benchmarks TAUKADIAL and PREPARE, our method outperforms the state-of-the-art multilingual and multimodal baseline in MCI classification while substantially reducing the performance gap across patient subgroups (sex and language). We further analyze transfer across datasets, showing that demographic unlearning helps learn more robust representations for MCI detection.


[73] REVES: REvision and VErification–Augmented Training for Test-Time Scaling cs.LG | cs.CLPDF

Yuanxin Liu, Ruida Zhou, Xinyan Zhao, Amr Sharaf, Hongzhou Lin

TL;DR: 本文提出REVES框架,通过交替进行在线数据/提示增强与策略优化,将多步推理中的中间错误答案转化为解耦的修订和验证提示,以增强大语言模型的测试时扩展能力。该方法在代码生成和数学推理等任务上显著优于基线模型。

Details

Motivation: 标准的后训练方法主要优化单次目标,与多步推理的动态过程存在根本性错位;而现有的多轮强化学习方法直接优化整个轨迹,未能充分利用模型可以从纠正高质量中间错误中学习的潜力。

Result: 在LiveCodeBench上,使用公开测试用例作为反馈,该方法比强化学习基线高出6.5分,比标准多轮训练高出4.0分;在圆填充任务上,仅使用最小的4B基础模型和更少的采样次数就达到了先前报告的SOTA水平;数学推理任务在真实验证下也证实了其改进的纠错能力。

Insight: 核心创新在于将成功恢复轨迹中的中间步骤(“接近正确”的答案)解耦为修订和验证提示,从而集中训练模型在答案转换和错误识别两方面的能力,实现了高效的离策略数据生成并降低了长视野采样的计算开销。

Abstract: Test-time scaling via sequential revision has emerged as a powerful paradigm for enhancing Large Language Model (LLM) reasoning. However, standard post-training methods primarily optimize single-shot objectives, creating a fundamental misalignment with multi-step inference dynamics. While recent work treats this as multi-turn reinforcement learning (RL), conventional approaches optimize over the multi-step trajectories directly, failing to further exploit the high-quality mistakes in intermediate steps that model can learn from correcting them. We propose a two-stage iterative framework that alternates between online data/prompt augmentation and policy optimization. By converting the intermediate steps (``near-miss’’ answers) in the successful recovery trajectories into decoupled revision and verification prompts, our approach concentrates training on both effective answer transformation and error identification. This approach enables efficient off-policy data generation and reduces the computational overhead of long-horizon sampling compared to standard multi-turn RL. On LiveCodeBench, using publicly available test cases as feedback, we observe gains of +6.5 points over the RL baseline and +4.0 points over standard multi-turn training. Beyond coding, our approach matches the previously reported SOTA result on circle packing while using the smallest base model (4B) and far fewer rollouts than the much larger evolutionary search systems. Math results under ground-truth verification further confirm improved correction ability. It also generalizes to out-of-distribution constraint-satisfaction puzzles such as n_queens and mini_sudoku, where correctness is defined entirely by problem constraints. Code is available at https://github.com/yxliu02/REVES.git.


[74] STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability cs.LG | cs.AI | cs.CLPDF

Haipeng Luo, Qingfeng Sun, Songli Wu, Can Xu, Wenfeng Deng

TL;DR: 本文针对GRPO等基于可验证奖励的强化学习算法在大型语言模型(LLM)后训练中常见的策略熵崩溃问题,通过一阶梯度分析发现令牌级信用分配不匹配,并提出了STARE方法。STARE通过基于惊奇值的令牌级优势重加权和目标熵闭环门控,在从1.5B到32B的不同规模模型以及多种推理任务上实现了稳定的训练和策略熵控制。

Details

Motivation: 动机是解决GRPO等算法在LLM复杂推理后训练过程中出现的策略熵崩溃问题,该问题源于令牌级熵动态分析中揭示的信用分配不匹配。

Result: 在AIME24和AIME25基准测试上,STARE的平均准确率比DAPO等基线方法高出4%-8%,并且在数千步的训练中能维持策略熵在目标区间内,同时反思令牌数量和响应长度同步增长,表明其保持了持续的探索-利用平衡。

Insight: 创新点在于对令牌级熵动态进行梯度分析,揭示了优势-惊奇值的四象限结构和近临界性,并据此设计了基于惊奇值分位数识别关键令牌子集、选择性重加权其有效优势,并结合目标熵闭环门控的稳定熵调控机制。

Abstract: Reinforcement Learning with Verifiable Rewards algorithms like GRPO have emerged as the dominant post-training paradigm for complex reasoning in LLMs, yet commonly suffer from policy entropy collapse during training. We conduct a first-order gradient analysis of token-level entropy dynamics under GRPO and identify a token-level credit assignment mismatch: the per-token entropy variation decomposes into the product of the trajectory-level advantage and an entropy sensitivity function over the next-token distribution, yielding an advantage-surprisal four-quadrant structure and a near-criticality property. Motivated by it, we propose STARE (Surprisal-guided Token-level Advantage Reweighting for policy Entropy stability), which identifies entropy-critical token subsets via batch-internal surprisal quantiles, selectively reweights their effective advantages, and incorporates a target-entropy closed-loop gate for stable entropy regulation. Across model scales from 1.5B to 32B and three task families (Short CoT, Long CoT, and Multi-Turn Tool Use), STARE sustains stable RL training over thousands of steps while maintaining policy entropy within the target band. On AIME24 and AIME25, STARE outperforms DAPO and other competitive baselines by 4%-8% in average accuracy, with reflection tokens and response length growing in tandem, indicating sustained exploration-exploitation balance that further unlocks RL training potential.Code is available at https://github.com/hp-luo/STARE.


[75] Structured Inference with Large Language Gibbs cs.LG | cs.CLPDF

Sanghyeok Choi, Henry Gouk, Esmeralda S. Whitammer

TL;DR: 本文提出了一种名为Large Language Gibbs的结构化概率推理方法,利用大语言模型(LLM)的条件分布作为马尔可夫链蒙特卡洛(MCMC)中的转移算子。该方法通过迭代地重新采样单个变量(以其他变量为条件),而非单次自回归生成,来避免顺序依赖偏差并产生一个平衡所有局部条件分布的平稳分布。研究将其应用于合成分布采样、一致性推理任务和贝叶斯结构学习,表明在可通过有噪声的LLM条件分布访问的世界先验下,这是一种实用的结构化推理替代方案。

Details

Motivation: 动机在于如何以概率一致的方式访问LLM中编码的知识,以支持对描述复杂世界的变量进行结构化推理,这本身是一个困难的推理问题。

Result: 该方法在合成分布采样、一致性推理任务和贝叶斯结构学习等任务上进行了评估,结果表明,在可通过有噪声的LLM条件分布访问的世界先验下,将LLM条件分布用于MCMC是进行结构化概率推理的一种实用替代方案。

Insight: 创新点在于将LLM的条件分布作为Gibbs采样中的转移算子,通过迭代重采样进行结构化推理,这避免了自回归生成的顺序偏差,并实现了局部条件分布之间的平衡。从客观角度看,这为利用LLM进行概率一致的、非自回归的结构化推理提供了一种新颖的框架。

Abstract: The knowledge encoded in large language models (LLMs) can serve as a substrate for structured reasoning over variables describing a complex world, but accessing this knowledge in a probabilistically coherent manner poses a difficult inference problem. We propose Large Language Gibbs, a scheme for structured probabilistic inference that uses conditional distributions of an LLM as transition operators. Rather than sampling structured objects through single-pass autoregressive generation, we iteratively resample individual variables conditioned on others using an LLM’s next-token conditionals. This approach avoids order-dependent biases and produces a stationary distribution that reflects a compromise between all local conditionals. We apply this approach to sampling from synthetic distributions, consistent reasoning tasks, and Bayesian structure learning. The results suggest that the use of LLM conditionals in MCMC is a practical alternative to one-pass generation for structured probabilistic inference under a world prior accessible through noisy LLM conditionals.


[76] Low-Cost Neuromorphic Fall Detection Using Synthetic Event Data and Hybrid SNNs cs.LG | cs.CVPDF

Guillermo Rojas, Gonzalo Soto, Daniel Yunge

TL;DR: 本文提出了一种混合模型,将脉冲神经网络(SNNs)与卷积神经网络(CNNs)组件相结合,用于处理由智能手机视频转换而成的模拟事件相机(DVS)数据,主要应用于人体跌倒检测。该方法利用SNNs的能效和时空处理能力,在多个数据集上进行仿真评估,结果表明其在保持准确性的同时显著提升了效率。

Details

Motivation: 解决在现实环境中进行复杂任务(如跌倒检测)时,传统方法能效较低的问题,旨在利用事件相机和脉冲神经网络的高效特性。

Result: 在多个数据集上的仿真结果表明,所提混合模型在保持准确性的同时,相比传统机器学习模型取得了显著的效率提升。

Insight: 创新点在于将SNNs与CNNs组件结合,并利用从常规视频合成的DVS数据进行训练,为在资源受限设备上部署高效的时空事件处理模型提供了新思路。

Abstract: This work presents the development of hybrid models that integrate spiking neural networks (SNNs) with components of convolutional neural networks (CNNs) to learn from simulated event-based camera data (Dynamic Vision Sensor, DVS) generated from conventional smartphone videos. Aimed primarily at human fall detection, the approach leverages the energy efficiency and spatio-temporal processing capabilities of SNNs by converting video frames into event-based data. The proposed models are evaluated through simulations on multiple datasets, comparing their performance to that of traditional machine learning models. Results demonstrate significant gains in efficiency without sacrificing accuracy, underscoring the potential of combining SNNs and DVS technology for complex tasks in real-world environments.


[77] Semantic Robustness Certification for Vision-Language Models cs.LG | cs.CVPDF

Peiyu Yang, Paul Montague, Feng Liu, Andrew C. Cullen, Amardeep Kaur

TL;DR: 本文提出了一种新颖的语义鲁棒性认证框架,用于评估视觉语言模型(VLM)在输入图像发生语义层面变化(如形状、大小、风格)时的预测稳定性。该框架利用VLM的开放词汇能力,以文本提示作为语义代理来构建可控程度的语义变换,并通过分析模型决策边界,定量认证预测类别保持不变的语义变换程度区间。

Details

Motivation: 现实应用中,VLM常面临由语义变化引起的分布偏移,而现有的认证框架主要关注几何或像素级变换,缺乏对语义层面变换的鲁棒性认证。

Result: 在合成数据和真实数据上的实验表明,该框架能够有效认证多种语义变化场景下的VLM鲁棒性,且无需为每种变化准备额外数据,具有实用性。

Insight: 创新点在于首次提出了针对语义层面变换的VLM鲁棒性认证方法,其核心是利用文本提示参数化语义变换并解析决策边界,实现了无需额外数据的、可量化的语义鲁棒性保证。

Abstract: Vision-language models (VLMs) are now widely used in downstream tasks. However, real-world applications often expose VLMs to distribution shifts induced by semantic variation (e.g., shape, size, and style). Robustness certification determines if a model’s prediction changes when transformations are applied to its input. While most certification frameworks study geometric or pixel-level transformations over inputs, this work proposes a novel framework that enables certifying VLM robustness under semantic-level transformations. Leveraging the open-vocabulary capability of VLMs, we use text prompts as semantic proxies to construct transformations parameterized by an extent that controls the degree of semantic variation. By characterizing the VLM decision boundary in closed form, our framework quantitatively certifies extent intervals for which the predicted class remains unchanged under the semantic transformation. Our framework is the first to certify VLM robustness under semantic-level variations without requiring additional data for each variation, making it practical to apply. Experiments on both synthetic and real-world data show that our framework enables certifying robustness under diverse semantic variations across scenarios.


[78] Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation cs.LG | cs.CVPDF

Sihan Wang, Xiyao Liu, Lianqing Liu, Zhi Han

TL;DR: 本文提出ViGOS框架,一种用于多模态大语言模型后训练的视觉基础策略自蒸馏方法。该方法将感知与推理解耦,学生模型先生成视觉描述再进行推理,由不同的教师模型分别监督,以解决传统策略自蒸馏在MLLM中产生的文本捷径问题。

Details

Motivation: 传统策略自蒸馏直接扩展到多模态大语言模型时,会产生文本捷径问题,即模型可能主要依赖文本参考目标而非图像信息来生成令牌。

Result: 在通用视觉语言、专家推理、视觉数学、空间定位和视觉语言先验等多个基准测试中,ViGOS在保持策略自蒸馏主要优势的同时,改善了在易出现捷径场景下的图像基础行为。

Insight: 核心创新在于将感知(视觉描述生成)与推理(最终答案推导)过程解耦,并引入图像专用感知教师和特权推理教师进行分阶段监督,有效缓解了多模态任务中的模态捷径偏差。

Abstract: On-policy self-distillation (OPSD) trains a model on its own rollouts and uses a frozen copy to provide dense token-level targets conditioned on a reference target. This works well for LLM reasoning, but a direct extension to multimodal large language models (MLLMs) can create a shortcut: the privileged target may guide tokens mainly based on the text reference target rather than the image. We propose ViGOS, a visually grounded OPSD framework for MLLM post-training. The student first writes a visual description and then reasons toward the final answer. For valid rollouts, an image-only perception teacher supervises the description, while a privileged reasoning teacher supervises the reasoning and final answer on the same student prefix. A reference teacher is used only for invalid rollouts to recover the output format. Across general vision-language, expert reasoning, visual math, spatial grounding, and visual-language-prior benchmarks, ViGOS keeps the main benefits of OPSD and improves image-grounded behavior in shortcut-prone settings.


[79] The Reward Was in Your Data All Along: Correcting Flow Matching with Discriminator-Guided RL cs.LG | cs.CVPDF

Nicolas Beltran-Velez, Felix Friedrich, Zhang Xiaofeng, Reyhane Askari-Hemmat, Xiaochuang Han

TL;DR: 本文提出了一种名为判别器引导强化学习(DRL)的方法,旨在解决基于匹配(如分数匹配和流匹配)的生成模型在训练目标与样本视觉/语义质量之间存在结构性不匹配的问题。DRL通过在一个预训练表示空间中训练判别器来区分真实数据和基础模型生成的样本,并将判别器的对数几率作为KL正则化强化学习的奖励,从而直接优化样本质量,无需依赖昂贵且可能混杂主观倾向的人类偏好数据。

Details

Motivation: 基于匹配的模型(如分数匹配、流匹配)的训练损失(如速度场或分数场的L2回归误差)与决定推理时样本质量的视觉和语义属性之间存在结构性不匹配,这导致模型往往需要依赖基于偏好的强化学习来恢复数据本身的真实性和连贯性。本文的动机是设计一种不依赖人类偏好的奖励信号,以直接纠正这种不匹配并提升样本质量。

Result: 在SiT、JiT、REPA和RAE等多种骨干模型上,DRL显著降低了无引导FID(例如,在SiT上从9.38降至2.62)和语义空间FD(例如,在SiT的DINOv3特征上从88.2降至19.3),在所有骨干模型上均取得了一致的提升。同时,DRL还改善了人类偏好奖励(尽管未在其上训练),并在后续基于偏好的微调中,在偏好奖励和图像保真度之间实现了更好的帕累托前沿,提高了对齐度并减少了过饱和、过亮等低级伪影。

Insight: 核心创新点在于利用预训练表示空间(如DINOv3)来约束判别器的学习方向,使其专注于感知上有意义的特征差异,并将判别器的对数几率作为KL正则化强化学习的奖励,该奖励在理论上近似数据与模型分布的对数似然比,是直接以数据分布为目标的优化信号。这提供了一种无需人类偏好标注、直接从数据本身获取高质量奖励来提升生成模型样本真实性和语义连贯性的有效途径。

Abstract: Score- and flow-matching models often rely on preference-based reinforcement learning for two purposes: aligning with subjective preferences and, surprisingly, recovering properties such as visual realism and coherent object structure that matching-based training is intended to learn from the data itself. We argue that this reflects a structural mismatch. Matching losses measure $\ell_2$ regression error on the velocity or score field under training-time marginals, a proxy poorly aligned with the visual and semantic properties that determine sample quality at inference. Given a reward aligned with these properties, RL sidesteps the mismatch by evaluating the model on its own samples and following the reward landscape directly. The challenge is to obtain such a reward without relying on human preferences, which are expensive and conflate data realism with annotator inclinations. We propose Discriminator-Guided RL (DRL). DRL trains a discriminator to separate data from base-model samples in a pretrained representation space and uses its logit as the reward in KL-regularized RL. The pretrained space restricts the discriminator to perceptually meaningful directions, and the logit estimates the log-likelihood ratio between data and model, which is the optimal reward for targeting the data distribution. Across SiT, JiT, REPA, and RAE, DRL reduces guidance-free FID (e.g., $9.38 \to 2.62$ on SiT) and semantic-space FD (e.g., $88.2 \to 19.3$ on DINOv3 for SiT), with consistent gains across all backbones, and improves human-preference rewards without training on them. It also yields a better Pareto frontier between preference reward and image fidelity under subsequent preference-based post-training, increasing alignment while reducing low-level artifacts such as oversaturation and excessive brightness.


cs.SE [Back]

[80] Written by AI, Managed by AI: Semantic Space Control and Index Sickness Elimination Across 391 Consecutive Sessions cs.SE | cs.CL | cs.HCPDF

Hui Zhang, Shuren Song

TL;DR: 本文通过一个持续约一个月、包含391次协作会话的真实软件项目(Bang-v3),研究了长期LLM协作中应对概念漂移的工程策略。研究发现,过度依赖符号约束(如标识符系统、系统提示规则)会导致LLM放弃对业务语义的真实理解,陷入自我指涉的推理,产生看似一致但与现实脱节的输出,作者将这种失效模式称为’索引病’。基于’庞氏原则(语义活力定律)’,论文设计并验证了’基线-日志物理分离’机制,该机制在项目中减少了约75%的AI指令量,并在后续约150次会话中消除了索引病的复发。

Details

Motivation: 解决长期LLM协作中,为应对概念漂移而过度增加形式化约束(如符号标识系统、系统提示规则)所导致的意外失效问题,即LLM可能放弃真实语义理解,产生与现实脱节的输出。

Result: 在Bang-v3项目中,提出的’基线-日志物理分离’机制将AI指令量减少了约75%,并在后续约150次协作会话中完全消除了’索引病’的复发。

Insight: 创新点在于识别并命名了’索引病’及其典型表现’幻影立法’,提出了’庞氏原则(语义活力定律)’,即承载明确目的的自然语言比符号表达传递的信息质量更高,并据此设计了有效的工程机制’基线-日志物理分离’来确保语义活力,避免LLM陷入自我指涉的符号推理。

Abstract: The prevailing engineering intuition for addressing conceptual drift in long-horizon LLM collaboration is to trade more formal constraints for more reliable outputs – designing symbolic identifier systems, accumulating defensive rules in System Prompts, expanding context windows. Our engineering record shows that in long-horizon settings, this direction may produce effects contrary to design intent. Using action research methods in a real software project (Bang-v3) spanning approximately one month and 391 collaborative sessions, we document and analyze the failure process of these strategies. When the symbolic system exceeds a complexity threshold, LLMs do not become more accurate – instead, they abandon genuine understanding of business semantics, retreat to self-referential reasoning within the symbolic layer, and generate outputs that appear internally consistent but are physically disconnected from reality. We name this failure pattern “Index Sickness,” and its canonical manifestation “Phantom Legislation.” We name the underlying principle the “Pang Principle (Semantic Vitality Law)”: natural language carrying explicit purpose conveys far greater information quality than symbolic expression. From this, we design and validate its physical engineering mechanism: “Baseline-Log Physical Separation.” In the same project, this mechanism reduced AI Instructions volume by ~75%, and across the subsequent ~150 sessions, no recurrence of Index Sickness was observed. A bilingual companion version (Chinese) is included as supplementary material.