Table of Contents
- cs.CL [Total: 37]
- cs.CV [Total: 50]
- cs.CR [Total: 1]
- cs.DB [Total: 1]
- cs.SE [Total: 1]
- cs.AI [Total: 7]
- cs.LG [Total: 6]
- cs.RO [Total: 3]
- eess.IV [Total: 3]
cs.CL [Back]
[1] StreetMath: Study of LLMs’ Approximation Behaviors
Chiung-Yi Tseng,Somshubhra Roy,Maisha Thasin,Danyang Zhang,Blessing Effiong
Main category: cs.CL
TL;DR: 论文提出StreetMath基准,评估LLM在非正式快速数学近似推理中的表现,发现LLM倾向于精确计算或依赖外部工具,且近似与精确计算依赖不同神经组件。
Details
Motivation: 现有研究关注LLM在精确算术运算中的表现,但缺乏对其近似推理能力的探索,尤其是在非自回归模型中。Contribution: 引入StreetMath基准以填补近似推理评估空白;揭示LLM在近似任务中的计算行为及其与精确计算的神经分离性。
Method: 设计StreetMath基准;评估多种LLM架构(如Qwen3-4B-Instruct-2507等);结合机制解释性技术分析模型内部状态。
Result: LLM在近似任务中倾向于精确计算或调用外部工具,且近似与精确计算依赖不同神经组件;模型未表现出人类认知吝啬性。
Insight: LLM的近似推理能力与精确计算机制存在显著差异,可能需针对性优化以模拟人类快速决策行为。
Abstract: There is a substantial body of literature examining the mathematical reasoning capabilities of large language models (LLMs), particularly their performance on precise arithmetic operations in autoregressive architectures. However, their ability to perform approximate reasoning in informal, fast-paced mathematical operations has received far less attention, especially among non-autoregressive decoder models. Our work addresses this gap by introducing StreetMath, a benchmark designed to evaluate models’ approximation abilities under real-world approximation scenarios. We conduct extensive evaluations across different LLM architectures: Qwen3-4B-Instruct-2507, Qwen3-4B-Thinking-2507, Dream-v0-Instruct-7B, Falcon-Mamba-7B-Instruct, and Mamba-GPT-3B. Furthermore, we apply mechanistic interpretability techniques to probe their internal computational states. Our analysis reveals that LLMs generally attempt to compute exact values or invoke external tools even in tasks that call for approximation. Moreover, while models sometimes reach the correct answer in early layers or steps, they still consume more tokens when solving approximation tasks. Additional experiments indicate that exact and approximate arithmetic operations rely on largely separate neural components. Drawing upon research on cognitive psychology, we argue that LLMs do not exhibit cognitive miserliness in the same way humans do in street math settings. We open source our work https://github.com/ctseng777/StreetMath
[2] LASTIST: LArge-Scale Target-Independent STance dataset
DongJae Kim,Yaejin Lee,Minsu Park,Eunil Park
Main category: cs.CL
TL;DR: 这篇论文提出了LASTIST数据集,填补了目标无关立场检测研究的空白,特别是在韩语等低资源语言中。数据集包含563,299个标注的韩语句子,来自韩国政党的新闻稿,支持目标无关立场检测和时间演变立场检测任务。
Details
Motivation: 当前立场检测研究主要集中在目标相关的任务,且大多数基准数据集基于英语,限制了低资源语言(如韩语)中立场检测模型的发展。LASTIST数据集的提出旨在解决这些局限性。Contribution: 论文的主要贡献是提出了大规模的LASTIST数据集,支持目标无关立场检测和时间演变立场检测任务,填补了韩语立场检测研究的空白。
Method: 通过收集韩国政党的新闻稿,标注563,299个韩语句子构建数据集。使用了最先进的深度学习模型进行训练和评估。
Result: LASTIST数据集为韩语立场检测提供了大规模、高质量的基准数据,支持多样化的研究任务。数据集已公开。
Insight: 该研究显示了在低资源语言中构建大规模立场检测数据集的可行性,为其他语言的类似研究提供了参考。
Abstract: Stance detection has emerged as an area of research in the field of artificial intelligence. However, most research is currently centered on the target-dependent stance detection task, which is based on a person’s stance in favor of or against a specific target. Furthermore, most benchmark datasets are based on English, making it difficult to develop models in low-resource languages such as Korean, especially for an emerging field such as stance detection. This study proposes the LArge-Scale Target-Independent STance (LASTIST) dataset to fill this research gap. Collected from the press releases of both parties on Korean political parties, the LASTIST dataset uses 563,299 labeled Korean sentences. We provide a detailed description of how we collected and constructed the dataset and trained state-of-the-art deep learning and stance detection models. Our LASTIST dataset is designed for various tasks in stance detection, including target-independent stance detection and diachronic evolution stance detection. We deploy our dataset on https://anonymous.4open.science/r/LASTIST-3721/.
[3] zFLoRA: Zero-Latency Fused Low-Rank Adapters
Dhananjaya Gowda,Seoha Song,Harshith Goka,Junhyun Lee
Main category: cs.CL
TL;DR: zFLoRA提出了一种零延迟的低秩适配器,解决了大型语言模型(LLM)在推理时因适配器参数导致的显著计算开销问题。实验表明,zFLoRA在多种任务和硬件平台上几乎不引入额外延迟。
Details
Motivation: 现有适配器虽然参数较少(通常不到基模型的1%),但推理时的计算开销显著(可达基模型的2.5倍),影响了部署效率。Contribution: 提出了一种零延迟的低秩适配器(zFLoRA),显著减少了推理时的开销,同时在性能上与常见微调方法(如LoRA和FFT)相当。
Method: 通过融合低秩适配器到基模型中,实现零或可忽略的延迟开销。
Result: 在1B、3B和7B规模的LLM上,zFLoRA在18个任务中表现优异,且硬件平台测试显示延迟几乎为零。
Insight: 适配器的优化设计可以显著提升LLM在实际部署中的效率,尤其是在资源受限的设备上。
Abstract: Large language models (LLMs) are increasingly deployed with task-specific adapters catering to multiple downstream applications. In such a scenario, the additional compute associated with these apparently insignificant number of adapter parameters (typically less than 1% of the base model) turns out to be disproportionately significant during inference time (upto 2.5x times that of the base model). In this paper, we propose a new zero-latency fused low-rank adapter (zFLoRA) that introduces zero or negligible latency overhead on top of the base model. Experimental results on LLMs of size 1B, 3B and 7B show that zFLoRA compares favorably against the popular supervised fine-tuning benchmarks including low-rank adapters (LoRA) as well as full fine-tuning (FFT). Experiments are conducted on 18 different tasks across three different categories namely commonsense reasoning, math reasoning and summary-dialogue. Latency measurements made on NPU (Samsung Galaxy S25+) as well as GPU (NVIDIA H100) platforms show that the proposed zFLoRA adapters introduce zero to negligible latency overhead.
[4] BlackboxNLP-2025 MIB Shared Task: Improving Circuit Faithfulness via Better Edge Selection
Yaniv Nikankin,Dana Arad,Itay Itzhak,Anja Reusch,Adi Simhi,Gal Kesten-Pomeranz,Yonatan Belinkov
Main category: cs.CL
TL;DR: 该论文通过在电路发现中引入引导法、比率选择策略和整数线性规划,改进了基于MIB的电路忠实性,并在多个任务和模型中表现优于现有方法。
Details
Motivation: 机制可解释性中的电路发现是一个关键挑战,现有方法在忠实性和性能之间存在权衡不足的问题。Contribution: 提出了三种改进电路发现的方法:引导法识别一致性边、比率选择策略优化边选择、整数线性规划替代贪婪选择。
Method: 结合引导法、比率选择策略和整数线性规划,提升电路发现的忠实性和性能。
Result: 在多个MIB任务和模型中,该方法生成的电路更忠实且性能优于现有方法。
Insight: 通过平衡性能和忠实性,优化边的选择策略能显著提升电路发现的效果。
Abstract: One of the main challenges in mechanistic interpretability is circuit discovery, determining which parts of a model perform a given task. We build on the Mechanistic Interpretability Benchmark (MIB) and propose three key improvements to circuit discovery. First, we use bootstrapping to identify edges with consistent attribution scores. Second, we introduce a simple ratio-based selection strategy to prioritize strong positive-scoring edges, balancing performance and faithfulness. Third, we replace the standard greedy selection with an integer linear programming formulation. Our methods yield more faithful circuits and outperform prior approaches across multiple MIB tasks and models. Our code is available at: https://github.com/technion-cs-nlp/MIB-Shared-Task.
[5] LISTEN to Your Preferences: An LLM Framework for Multi-Objective Selection
Adam S. Jovine,Tinghan Ye,Francis Bahk,Jingjing Wang,David B. Shmoys,Peter I. Frazier
Main category: cs.CL
TL;DR: LISTEN是一个利用LLM作为零样本偏好预测器的框架,通过自然语言指导解决多目标选择问题,提出了LISTEN-U和LISTEN-T两种迭代算法,分别适用于参数化和非参数化偏好场景。
Details
Motivation: 人类专家在多目标选择中常因难以形式化复杂偏好而面临困难,本文旨在通过LLM框架减轻这种认知负担。Contribution: 提出了LISTEN框架,利用LLM作为偏好预测器,并引入两种算法(LISTEN-U和LISTEN-T)以适应不同偏好场景,同时提出新的concordance metric衡量参数化偏好对齐程度。
Method: LISTEN-U通过LLM迭代优化参数化效用函数;LISTEN-T采用锦标赛式选择在小批量解中迭代筛选,两者均基于LLM的自然语言指导。
Result: 在航班预订、购物和考试安排等任务中,LISTEN-U在参数化偏好场景表现优异,而LISTEN-T在非参数化场景更具鲁棒性。
Insight: 通过LLM直接引导多目标决策是可行的,为自然语言驱动的复杂决策提供了新思路。
Abstract: Human experts often struggle to select the best option from a large set of items with multiple competing objectives, a process bottlenecked by the difficulty of formalizing complex, implicit preferences. To address this, we introduce LISTEN, a framework that leverages a Large Language Model (LLM) as a zero-shot preference oracle, guided only by an expert’s high-level priorities in natural language. To operate within LLM constraints like context windows and inference costs, we propose two iterative algorithms: LISTEN-U, which uses the LLM to refine a parametric utility function, and LISTEN-T, a non-parametric method that performs tournament-style selections over small batches of solutions. Evaluated on diverse tasks including flight booking, shopping, and exam scheduling, our results show LISTEN-U excels when preferences are parametrically aligned (a property we measure with a novel concordance metric), while LISTEN-T offers more robust performance. This work explores a promising direction for steering complex multi-objective decisions directly with natural language, reducing the cognitive burden of traditional preference elicitation.
[6] Beyond Length: Quantifying Long-Range Information for Long-Context LLM Pretraining Data
Haoran Deng,Yingyu Lin,Zhenghao Lin,Xiao Liu,Yizhou Sun,Yi-An Ma,Yeyun Gong
Main category: cs.CL
TL;DR: 论文提出了LongFilter框架,用于筛选适合长上下文预训练的高质量数据,通过对比长上下文和短上下文的模型预测差异来量化长距离依赖性,实验证明其显著提升了模型性能。
Details
Motivation: 长上下文语言模型能力强,但大量可用长文本数据缺乏有意义的长距离依赖关系,训练效率低,因此需要有效的数据筛选方法。Contribution: 提出了LongFilter框架,用于高效筛选长上下文预训练数据,量化长距离信息的重要性。
Method: 通过对比模型在长上下文和短上下文下的预测差异,识别依赖长距离关系的样本。
Result: 在LLaMA-3-8B上将上下文长度扩展到64K,显著提升了HELMET、LongBench和RULER等基准的表现。
Insight: 长距离依赖性的量化有助于高效数据选择,提升长上下文模型的训练效果。
Abstract: Long-context language models unlock advanced capabilities in reasoning, code generation, and document summarization by leveraging dependencies across extended spans of text. However, a significant portion of readily available long-text data lacks meaningful long-distance dependencies; most spans can be predicted using only local context. Training on such data is inefficient, making careful data selection crucial. Therefore, we introduce LongFilter, a framework for curating training data tailored to long-context pretraining. LongFilter measures the information gain provided by extended context by contrasting model predictions under long-context versus short-context settings, thereby identifying samples where long-range dependencies are essential. Experiments with LLaMA-3-8B, extending its context length from 8K to 64K, show that LongFilter efficiently selects high-quality data and yields substantial improvements on benchmarks such as HELMET, LongBench, and RULER.
[7] Ideology-Based LLMs for Content Moderation
Stefano Civelli,Pietro Bernardelle,Nardiena A. Pratama,Gianluca Demartini
Main category: cs.CL
TL;DR: 本文研究了大型语言模型(LLMs)在内容审核中因角色设定(persona)而产生的意识形态偏见,发现角色设定会影响模型对有害内容的分类一致性,并导致跨意识形态的分歧。
Details
Motivation: 大型语言模型在内容审核中广泛应用,但确保其公平性和中立性是一个关键挑战。研究旨在探索角色设定如何影响模型对不同意识形态内容的有害性判断。Contribution: 揭示了角色设定会导致LLMs在内容审核中产生意识形态偏见,特别是在同意识形态内的一致性和跨意识形态的分歧方面。
Method: 通过实验分析不同LLM架构、模型规模和内容模态(语言与视觉)中角色设定对有害内容分类的影响,并评估其在政治任务中的表现。
Result: 角色设定显著影响模型对内容有害性的判断,同意识形态内的一致性增强,但跨意识形态的分歧扩大,尤其是大型模型。
Insight: 研究提示在设计内容审核系统时需警惕角色设定引入的隐性偏见,避免强化特定意识形态的观点。
Abstract: Large language models (LLMs) are increasingly used in content moderation systems, where ensuring fairness and neutrality is essential. In this study, we examine how persona adoption influences the consistency and fairness of harmful content classification across different LLM architectures, model sizes, and content modalities (language vs. vision). At first glance, headline performance metrics suggest that personas have little impact on overall classification accuracy. However, a closer analysis reveals important behavioral shifts. Personas with different ideological leanings display distinct propensities to label content as harmful, showing that the lens through which a model “views” input can subtly shape its judgments. Further agreement analyses highlight that models, particularly larger ones, tend to align more closely with personas from the same political ideology, strengthening within-ideology consistency while widening divergence across ideological groups. To show this effect more directly, we conducted an additional study on a politically targeted task, which confirmed that personas not only behave more coherently within their own ideology but also exhibit a tendency to defend their perspective while downplaying harmfulness in opposing views. Together, these findings highlight how persona conditioning can introduce subtle ideological biases into LLM outputs, raising concerns about the use of AI systems that may reinforce partisan perspectives under the guise of neutrality.
[8] Beyond Long Context: When Semantics Matter More than Tokens
Tarun Kumar Chawdhury,Jon D. Duke
Main category: cs.CL
TL;DR: 论文提出了一种名为CLEAR的临床实体增强检索方法,显著提升了电子健康记录(EHR)问答任务的准确性和效率,相比传统方法减少了70%以上的token使用。
Details
Motivation: 传统向量数据库方法在处理电子健康记录(EHR)时难以捕捉复杂的临床语义关系,尤其在长文档场景下效率低下。Contribution: 1. 提出了CLEAR方法,通过实体感知检索提升语义问答性能;2. 开发了Clinical Notes QA Evaluation Platform验证方法的有效性。
Method: CLEAR结合实体感知检索,显著减少了token使用,同时在12份临床笔记(10k-65k token)上验证了其表现。
Result: CLEAR在F1得分(0.90 vs 0.86)、语义相似度(0.878)和token效率(减少78%)上均优于传统方法。
Insight: 实体感知检索不仅在语义上更精确,还能在处理长文档时显著提升效率。
Abstract: Electronic Health Records (EHR) store clinical documentation as base64 encoded attachments in FHIR DocumentReference resources, which makes semantic question answering difficult. Traditional vector database methods often miss nuanced clinical relationships. The Clinical Entity Augmented Retrieval (CLEAR) method, introduced by Lopez et al. 2025, uses entity aware retrieval and achieved improved performance with an F1 score of 0.90 versus 0.86 for embedding based retrieval, while using over 70 percent fewer tokens. We developed a Clinical Notes QA Evaluation Platform to validate CLEAR against zero shot large context inference and traditional chunk based retrieval augmented generation. The platform was tested on 12 clinical notes ranging from 10,000 to 65,000 tokens representing realistic EHR content. CLEAR achieved a 58.3 percent win rate, an average semantic similarity of 0.878, and used 78 percent fewer tokens than wide context processing. The largest performance gains occurred on long notes, with a 75 percent win rate for documents exceeding 65,000 tokens. These findings confirm that entity aware retrieval improves both efficiency and accuracy in clinical natural language processing. The evaluation framework provides a reusable and transparent benchmark for assessing clinical question answering systems where semantic precision and computational efficiency are critical.
[9] A Survey on Efficient Large Language Model Training: From Data-centric Perspectives
Junyu Luo,Bohan Wu,Xiao Luo,Zhiping Xiao,Yiqiao Jin,Rong-Cheng Tu,Nan Yin,Yifan Wang,Jingyang Yuan,Wei Ju,Ming Zhang
Main category: cs.CL
TL;DR: 本文是第一篇从数据为中心的视角系统性地综述数据高效的大型语言模型(LLM)后训练方法的论文,提出了分类法并总结了代表性方法,旨在解决后训练中的数据挑战。
Details
Motivation: 当前LLM后训练面临高昂的人工标注成本和数据规模边际效益递减问题,亟需数据高效的后训练方法。Contribution: 提出了数据高效LLM后训练的分类法,涵盖数据选择、质量增强、合成数据生成、数据蒸馏与压缩以及自演进数据生态系统,并总结了各类方法。
Method: 通过系统性文献综述,分类整理数据高效后训练方法,分析代表性技术的优缺点。
Result: 总结了当前数据高效后训练的研究现状,指出了未来研究方向。
Insight: 数据高效利用是提升LLM后训练效果的关键,未来需进一步探索数据质量提升和自适应数据生成方法。
Abstract: Post-training of Large Language Models (LLMs) is crucial for unlocking their task generalization potential and domain-specific capabilities. However, the current LLM post-training paradigm faces significant data challenges, including the high costs of manual annotation and diminishing marginal returns on data scales. Therefore, achieving data-efficient post-training has become a key research question. In this paper, we present the first systematic survey of data-efficient LLM post-training from a data-centric perspective. We propose a taxonomy of data-efficient LLM post-training methods, covering data selection, data quality enhancement, synthetic data generation, data distillation and compression, and self-evolving data ecosystems. We summarize representative approaches in each category and outline future research directions. By examining the challenges in data-efficient LLM post-training, we highlight open problems and propose potential research avenues. We hope our work inspires further exploration into maximizing the potential of data utilization in large-scale model training. Paper List: https://github.com/luo-junyu/Awesome-Data-Efficient-LLM
[10] Evaluating the Impact of LLM-Assisted Annotation in a Perspectivized Setting: the Case of FrameNet Annotation
Frederico Belcavello,Ely Matos,Arthur Lorenzi,Lisandra Bonoto,Lívia Ruiz,Luiz Fernando Pereira,Victor Herbst,Yulla Navarro,Helen de Andrade Abreu,Lívia Dutra,Tiago Timponi Torrent
Main category: cs.CL
TL;DR: 论文评估了LLM辅助标注在FrameNet语义标注中的影响,比较了手动、自动和半自动三种标注方式的性能,结果显示半自动方式在标注多样性和覆盖率上表现最佳。
Details
Motivation: LLM辅助标注在语言资源和数据集创建中的应用日益普遍,但其性能和影响尚未得到充分评估,尤其是在NLP的多视角研究中。Contribution: 填补了LLM辅助标注在FrameNet语义标注中性能评估的空白,提供了详尽的实验对比。
Method: 比较了手动、自动和半自动三种标注设置下的标注时间、覆盖率和多样性。
Result: 半自动标注方式在标注多样性和覆盖率上表现最优,自动标注方式仅在标注时间上表现较好。
Insight: LLM辅助标注可以提高语义标注的效率和质量,尤其是在半自动模式下,结合人类专家的干预效果更显著。
Abstract: The use of LLM-based applications as a means to accelerate and/or substitute human labor in the creation of language resources and dataset is a reality. Nonetheless, despite the potential of such tools for linguistic research, comprehensive evaluation of their performance and impact on the creation of annotated datasets, especially under a perspectivized approach to NLP, is still missing. This paper contributes to reduction of this gap by reporting on an extensive evaluation of the (semi-)automatization of FrameNet-like semantic annotation by the use of an LLM-based semantic role labeler. The methodology employed compares annotation time, coverage and diversity in three experimental settings: manual, automatic and semi-automatic annotation. Results show that the hybrid, semi-automatic annotation setting leads to increased frame diversity and similar annotation coverage, when compared to the human-only setting, while the automatic setting performs considerably worse in all metrics, except for annotation time.
[11] RECAP: Reproducing Copyrighted Data from LLMs Training with an Agentic Pipeline
André V. Duarte,Xuying li,Bin Zeng,Arlindo L. Oliveira,Lei Li,Zhuo Li
Main category: cs.CL
TL;DR: RECAP是一种代理管道,用于从大型语言模型(LLM)中提取和验证训练数据中的记忆内容,通过反馈驱动的循环和越狱模块显著提高提取效果。
Details
Motivation: 研究动机在于如何在不直接检查LLM训练数据的情况下,验证模型是否记忆了某些内容。Contribution: 主要贡献是提出RECAP管道,包含反馈驱动的提取循环和越狱模块,显著提高记忆力内容的提取效果。
Method: RECAP的核心方法是:1)通过反馈循环提取和验证内容;2)使用越狱模块克服对齐性拒绝;3)利用EchoTrace基准进行评估。
Result: 实验结果显示,RECAP将GPT-4.1在ROUGE-L上的平均分从0.38提升至0.47,提升了近24%。
Insight: 研究表明,LLM中的记忆内容可以通过代理管道有效提取,且模型的内部对齐可能阻碍某些内容的生成。
Abstract: If we cannot inspect the training data of a large language model (LLM), how can we ever know what it has seen? We believe the most compelling evidence arises when the model itself freely reproduces the target content. As such, we propose RECAP, an agentic pipeline designed to elicit and verify memorized training data from LLM outputs. At the heart of RECAP is a feedback-driven loop, where an initial extraction attempt is evaluated by a secondary language model, which compares the output against a reference passage and identifies discrepancies. These are then translated into minimal correction hints, which are fed back into the target model to guide subsequent generations. In addition, to address alignment-induced refusals, RECAP includes a jailbreaking module that detects and overcomes such barriers. We evaluate RECAP on EchoTrace, a new benchmark spanning over 30 full books, and the results show that RECAP leads to substantial gains over single-iteration approaches. For instance, with GPT-4.1, the average ROUGE-L score for the copyrighted text extraction improved from 0.38 to 0.47 - a nearly 24% increase.
[12] SymCode: A Neurosymbolic Approach to Mathematical Reasoning via Verifiable Code Generation
Sina Bagheri Nezhad,Yao Li,Ameeta Agrawal
Main category: cs.CL
TL;DR: SymCode 是一个结合神经与符号方法(neurosymbolic)的框架,通过生成可验证的代码来解决数学推理问题,显著提升了 LLM 在复杂数学任务中的准确性和可验证性。
Details
Motivation: 大型语言模型(LLMs)在处理复杂数学推理时,常因生成未经验证的文本解答而出现算术错误。当前的提示方法(如 Chain of Thought)缺乏确定性验证机制。SymCode 旨在通过可验证的代码生成解决这一问题。Contribution: SymCode 的主要贡献是将数学问题解决任务重新定义为使用 SymPy 库生成可验证代码的任务,显著提升了 LLM 的准确性和可验证性。
Method: SymCode 结合了神经与符号方法,利用 SymPy 库生成可验证的代码。其核心是将数学问题转化为代码,并通过符号计算引擎确保解答的正确性。
Result: 在 MATH-500 和 OlympiadBench 等基准测试中,SymCode 相比基线模型提升了高达 13.6 个百分点的准确率,同时更高效地利用了 token。
Insight: SymCode 的成功表明,将 LLM 的推理能力与符号计算引擎结合,不仅能提高准确性,还能将模型失败从模糊的逻辑错误转变为透明的程序错误,为形式化领域的可信 AI 提供了关键思路。
Abstract: Large Language Models (LLMs) often struggle with complex mathematical reasoning, where prose-based generation leads to unverified and arithmetically unsound solutions. Current prompting strategies like Chain of Thought still operate within this unreliable medium, lacking a mechanism for deterministic verification. To address these limitations, we introduce SymCode, a neurosymbolic framework that reframes mathematical problem-solving as a task of verifiable code generation using the SymPy library. We evaluate SymCode on challenging benchmarks, including MATH-500 and OlympiadBench, demonstrating significant accuracy improvements of up to 13.6 percentage points over baselines. Our analysis shows that SymCode is not only more token-efficient but also fundamentally shifts model failures from opaque logical fallacies towards transparent, programmatic errors. By grounding LLM reasoning in a deterministic symbolic engine, SymCode represents a key step towards more accurate and trustworthy AI in formal domains.
[13] AttnCache: Accelerating Self-Attention Inference for LLM Prefill via Attention Cache
Dinghong Song,Yuan Feng,Yiwei Wang,Shangye Chen,Cyril Guyot,Filip Blagojevic,Hyeran Jeon,Pengfei Su,Dong Li
Main category: cs.CL
TL;DR: AttnCache提出了一种通过复用相似注意力图来加速LLM预填充阶段的方法,显著降低了自注意力计算的开销。
Details
Motivation: 许多现实任务仅依赖LLM的预填充阶段,而自注意力计算的二次复杂度成为主要性能瓶颈。语义不同的句子常产生相似的注意力图,这启发了复用注意力图的思路。Contribution: 提出AttnCache框架,通过注意力图缓存和相似性搜索技术,加速LLM预填充阶段的自注意力计算。
Method: 构建注意力图记忆数据库,利用缓存和相似性搜索技术复用预存的注意力图,减少计算开销。
Result: 实验显示,AttnCache在CPU上实现1.2倍端到端和2倍注意力加速,GPU上实现1.6倍端到端和3倍注意力加速,且精度损失可忽略。
Insight: 不同语义输入的注意力图存在相似性,利用这一特性可以显著优化LLM推理效率。
Abstract: Large Language Models (LLMs) are widely used in generative applications such as chatting, code generation, and reasoning. However, many realworld workloads such as classification, question answering, recommendation, and text embedding rely solely on the prefill stage of inference, where the model encodes input sequences without performing autoregressive decoding. In these prefill only scenarios, the self-attention computation becomes the primary performance bottleneck due to its quadratic complexity with respect to sequence length. In this paper, we observe that semantically different sentences often produce similar attention maps across layers and heads. Building on this insight, we propose AttnCache, a framework that accelerates the prefill stage of LLM inference by retrieving and reusing similar attention maps. Based on an attention map memorization database, AttnCache employs efficient caching and similarity search techniques to identify and reuse pre-cached attention maps during inference, thereby reducing the computational overhead of self-attention. Experimental results show that AttnCache achieves an average of 1.2x end-to-end and 2x attention speedup on CPU, and 1.6x end-to-end and 3x attention speedup on GPU, with negligible accuracy degradation.
[14] Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
Yihe Deng,I-Hung Hsu,Jun Yan,Zifeng Wang,Rujun Han,Gufeng Zhang,Yanfei Chen,Wei Wang,Tomas Pfister,Chen-Yu Lee
Main category: cs.CL
TL;DR: 该论文提出了一种名为监督强化学习(SRL)的新框架,用于解决大型语言模型在多步推理任务中的不足。通过结合专家轨迹和逐步推理,SRL能够更有效地训练小规模开源模型。
Details
Motivation: 大型语言模型在多步推理任务中表现不佳,而传统的监督微调和强化学习方法也存在局限性,如过拟合或难以采样正确解。SRL旨在填补这一空白。Contribution: SRL的核心贡献是将问题解决重新定义为生成一系列逻辑动作的过程,并通过逐步监督信号提高模型的学习效果,即使所有尝试均失败时也能提供丰富反馈。
Method: SRL训练模型在生成每个动作前先产生内部推理独白,并通过与专家动作的逐步相似性计算奖励。此外,SRL还结合了SFT和RLVR的优势。
Result: SRL成功使小模型学习到传统方法无法解决的复杂问题,且在结合RLVR后性能进一步提升。此外,SRL在软件工程任务中也表现出色。
Insight: SRL的关键在于将逐步推理与专家监督相结合,从而为模型提供了更灵活的推理能力和更稳定的学习信号。这种方法在推理任务中具有通用性和鲁棒性。
Abstract: Large Language Models (LLMs) often struggle with problems that require multi-step reasoning. For small-scale open-source models, Reinforcement Learning with Verifiable Rewards (RLVR) fails when correct solutions are rarely sampled even after many attempts, while Supervised Fine-Tuning (SFT) tends to overfit long demonstrations through rigid token-by-token imitation. To address this gap, we propose Supervised Reinforcement Learning (SRL), a framework that reformulates problem solving as generating a sequence of logical “actions”. SRL trains the model to generate an internal reasoning monologue before committing to each action. It provides smoother rewards based on the similarity between the model’s actions and expert actions extracted from the SFT dataset in a step-wise manner. This supervision offers richer learning signals even when all rollouts are incorrect, while encouraging flexible reasoning guided by expert demonstrations. As a result, SRL enables small models to learn challenging problems previously unlearnable by SFT or RLVR. Moreover, initializing training with SRL before refining with RLVR yields the strongest overall performance. Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning-oriented LLMs.
[15] PORTool: Tool-Use LLM Training with Rewarded Tree
Feijie Wu,Weiwu Zhu,Yuxiang Zhang,Soumya Chatterjee,Jiarong Zhu,Fan Mo,Rodin Luo,Jing Gao
Main category: cs.CL
TL;DR: PORTool提出了一种基于强化学习的工具使用LLM训练方法,通过生成多样化的轨迹树并分配步级奖励,提升模型在动态工具调用环境中的性能。
Details
Motivation: 现有工具使用LLM依赖静态数据集训练,导致其在动态工具调用环境中表现有限,无法探索多样化解法。PORTool旨在解决这一问题。Contribution: 提出PORTool方法,利用强化学习鼓励模型探索多样化的工具调用轨迹,并通过步级奖励和相对优势计算优化训练。
Method: 1. 生成多样化的查询轨迹树;2.为每一步分配奖励;3.结合fork-relative和trajectory-relative优势训练LLM。
Result: 在17种工具的实验环境中,PORTool显著提升了最终准确率和工具调用步数表现。
Insight: 步级奖励设计对模型性能至关重要,多样化的轨迹探索能有效增强动态环境中的工具使用能力。
Abstract: Current tool-use large language models (LLMs) are trained on static datasets, enabling them to interact with external tools and perform multi-step, tool-integrated reasoning, which produces tool-call trajectories. However, these models imitate how a query is resolved in a generic tool-call routine, thereby failing to explore possible solutions and demonstrating limited performance in an evolved, dynamic tool-call environment. In this work, we propose PORTool, a reinforcement learning (RL) method that encourages a tool-use LLM to explore various trajectories yielding the correct answer. Specifically, this method starts with generating multiple rollouts for a given query, and some of them share the first few tool-call steps, thereby forming a tree-like structure. Next, we assign rewards to each step, based on its ability to produce a correct answer and make successful tool calls. A shared step across different trajectories receives the same reward, while different steps under the same fork receive different rewards. Finally, these step-wise rewards are used to calculate fork-relative advantages, blended with trajectory-relative advantages, to train the LLM for tool use. The experiments utilize 17 tools to address user queries, covering both time-sensitive and time-invariant topics. We conduct ablation studies to systematically justify the necessity and the design robustness of step-wise rewards. Furthermore, we compare the proposed PORTool with other training approaches and demonstrate significant improvements in final accuracy and the number of tool-call steps.
[16] QCoder Benchmark: Bridging Language Generation and Quantum Hardware through Simulator-Based Feedback
Taku Mikuriya,Tatsuya Ishigaki,Masayuki Kawarada,Shunya Minami,Tadashi Kadowaki,Yohichi Suzuki,Soshun Naito,Shunya Takata,Takumi Kato,Tamotsu Basseda,Reo Yamada,Hiroya Takamura
Main category: cs.CL
TL;DR: QCoder Benchmark是一个评估大型语言模型(LLM)在量子编程任务上表现的框架,通过模拟硬件反馈和比较人类编写的代码来支持定量和定性分析。结果显示,推理型模型在准确性上显著优于GPT-4o和人类平均水平。
Details
Motivation: 虽然LLM在自动代码生成任务上表现优异,但在需要与硬件设备交互的领域(如量子编程)仍未被充分探索。QCoder旨在填补这一空白,提出一种结合模拟硬件反馈的评估方法。Contribution: 1. 提出QCoder Benchmark,支持基于量子模拟器的评估,反馈电路深度等指标;2. 结合人类编写的代码数据集,支持与LLM输出的比较;3. 公开数据集和API。
Method: 通过量子模拟器环境提供特定领域的反馈指标(如电路深度、执行时间等),并结合人类编写的代码进行对比分析。
Result: GPT-4o在任务上仅达到18.97%的准确率,而推理型模型(如o3)达到78%,显著优于人类平均水平(39.98%)。
Insight: 量子编程任务对LLM的挑战性较高,推理能力强的模型表现更优;模拟硬件反馈可以提供更有指导性的生成结果。
Abstract: Large language models (LLMs) have increasingly been applied to automatic programming code generation. This task can be viewed as a language generation task that bridges natural language, human knowledge, and programming logic. However, it remains underexplored in domains that require interaction with hardware devices, such as quantum programming, where human coders write Python code that is executed on a quantum computer. To address this gap, we introduce QCoder Benchmark, an evaluation framework that assesses LLMs on quantum programming with feedback from simulated hardware devices. Our benchmark offers two key features. First, it supports evaluation using a quantum simulator environment beyond conventional Python execution, allowing feedback of domain-specific metrics such as circuit depth, execution time, and error classification, which can be used to guide better generation. Second, it incorporates human-written code submissions collected from real programming contests, enabling both quantitative comparisons and qualitative analyses of LLM outputs against human-written codes. Our experiments reveal that even advanced models like GPT-4o achieve only around 18.97% accuracy, highlighting the difficulty of the benchmark. In contrast, reasoning-based models such as o3 reach up to 78% accuracy, outperforming averaged success rates of human-written codes (39.98%). We release the QCoder Benchmark dataset and public evaluation API to support further research.
[17] Reasoning Path Divergence: A New Metric and Curation Strategy to Unlock LLM Diverse Thinking
Feng Ju,Zeyu Qin,Rui Min,Zhitao He,Lingpeng Kong,Yi R. Fung
Main category: cs.CL
TL;DR: 论文提出了一种新的训练范式1PNS(一个问题,多种解法)和度量指标RPD(推理路径分歧度),以提升大语言模型(LLM)在推理任务中的输出多样性。通过RPD筛选多样化解法集合并微调模型,实验显示该方法显著提高了pass@k性能。
Details
Motivation: 传统的大语言模型训练采用1P1S(一个问题,一种解法)模式,导致模型输出的推理路径单一,限制了多样性和性能提升。论文旨在通过引入多样化训练和量化指标来解决这一问题。Contribution: 1. 提出1PNS训练范式,增加模型对不同推理路径的接触;2. 设计RPD指标,量化多步推理链之间的语义差异;3. 实验证明RPD筛选的训练数据能显著提升模型性能和输出多样性。
Method: 使用RPD指标对齐和评分长推理链(Long Chain-of-Thought),筛选出多样化的解法集合,并以此微调Qwen3-4B-Base模型。
Result: 实验结果表明,RPD筛选的训练数据在pass@16上平均提升2.80%,在AIME24上提升4.99%,验证了1PNS的有效性。
Insight: 多样化训练不仅提升了模型性能,还揭示了传统单一答案训练的局限性。RPD为量化推理多样性提供了新工具。
Abstract: While Test-Time Scaling (TTS) has proven effective in improving the reasoning ability of large language models (LLMs), low diversity in model outputs often becomes a bottleneck; this is partly caused by the common “one problem, one solution” (1P1S) training practice, which provides a single canonical answer and can push models toward a narrow set of reasoning paths. To address this, we propose a “one problem, multiple solutions” (1PNS) training paradigm that exposes the model to a variety of valid reasoning trajectories and thus increases inference diversity. A core challenge for 1PNS is reliably measuring semantic differences between multi-step chains of thought, so we introduce Reasoning Path Divergence (RPD), a step-level metric that aligns and scores Long Chain-of-Thought solutions to capture differences in intermediate reasoning. Using RPD, we curate maximally diverse solution sets per problem and fine-tune Qwen3-4B-Base. Experiments show that RPD-selected training yields more varied outputs and higher pass@k, with an average +2.80% gain in pass@16 over a strong 1P1S baseline and a +4.99% gain on AIME24, demonstrating that 1PNS further amplifies the effectiveness of TTS. Our code is available at https://github.com/fengjujf/Reasoning-Path-Divergence .
[18] On the Influence of Discourse Relations in Persuasive Texts
Nawar Turk,Sevag Kaspar,Leila Kosseim
Main category: cs.CL
TL;DR: 论文研究了说服技巧(PTs)与语篇关系(DRs)之间的联系,利用大语言模型(LLMs)和提示工程构建了五个银标注数据集,并发现六种语篇关系在说服性文本中起关键作用。
Details
Motivation: 目前缺乏同时标注PTs和DRs的数据集,且理解DRs在说服性文本中的作用对检测网络宣传和虚假信息有重要意义。Contribution: 提出了基于LLMs的DRs标注方法,构建了五个银数据集,并揭示了六种DRs在说服性文本中的核心作用。
Method: 利用四种LLMs和十种提示工程构建DR分类器,通过多数投票策略生成银数据集,并进行统计分析。
Result: 发现Cause、Purpose等六种DRs在Loaded Language等说服技巧中尤为重要。
Insight: DRs的区分有助于识别说服性文本中的潜在手段,为虚假信息检测提供了新视角。
Abstract: This paper investigates the relationship between Persuasion Techniques (PTs) and Discourse Relations (DRs) by leveraging Large Language Models (LLMs) and prompt engineering. Since no dataset annotated with both PTs and DRs exists, we took the SemEval 2023 Task 3 dataset labelled with 19 PTs as a starting point and developed LLM-based classifiers to label each instance of the dataset with one of the 22 PDTB 3.0 level-2 DRs. In total, four LLMs were evaluated using 10 different prompts, resulting in 40 unique DR classifiers. Ensemble models using different majority-pooling strategies were used to create 5 silver datasets of instances labelled with both persuasion techniques and level-2 PDTB senses. The silver dataset sizes vary from 1,281 instances to 204 instances, depending on the majority pooling technique used. Statistical analysis of these silver datasets shows that six discourse relations (namely Cause, Purpose, Contrast, Cause+Belief, Concession, and Condition) play a crucial role in persuasive texts, especially in the use of Loaded Language, Exaggeration/Minimisation, Repetition and to cast Doubt. This insight can contribute to detecting online propaganda and misinformation, as well as to our general understanding of effective communication.
[19] RCScore: Quantifying Response Consistency in Large Language Models
Dongjun Jang,Youngchae Ahn,Hyopil Shin
Main category: cs.CL
TL;DR: RCScore是一个量化大语言模型响应一致性的框架,通过多维度评估指令风格对模型输出的影响,揭示了传统指标忽视的性能变化,并提出Cross-Response Similarity(CRS)作为模型可靠性的代理指标。
Details
Motivation: 当前大语言模型的评估通常依赖单一指令模板,忽视了指令风格对模型输出的敏感性。RCScore旨在填补这一空白,量化指令风格的影响,提升模型在实际部署中的可靠性。Contribution: 1)提出RCScore框架,量化指令风格对模型输出的影响;2)引入Cross-Response Similarity(CRS)方法,测量模型的自一致性;3)揭示了指令风格变化对性能的影响(高达16.7%)。
Method: 通过系统性变换基准问题的指令风格,RCScore生成多维度评估指标。CRS方法进一步将RCScore应用于测量模型的风格自一致性。
Result: 实验表明指令风格可显著影响模型性能(最高16.7%),CRS与任务准确性强相关,确定性解码输出更稳定,模型规模与一致性正相关。
Insight: 指令鲁棒性是评估大语言模型的重要维度,一致性可作为可靠性的代理指标,模型规模的增加可能提升一致性。
Abstract: Current LLM evaluations often rely on a single instruction template, overlooking models’ sensitivity to instruction style-a critical aspect for real-world deployments. We present RCScore, a multi-dimensional framework quantifying how instruction formulation affects model responses. By systematically transforming benchmark problems into multiple instruction styles, RCScore reveals performance variations undetected by conventional metrics. Our experiments across ten LLMs on four reasoning benchmarks demonstrate that instruction style can shift accuracy by up to 16.7% points. We introduce Cross-Response Similarity (CRS), a method applying RCScore metrics to measure stylistic self-consistency, and establish its strong correlation with task accuracy, suggesting consistency as a valuable proxy for model reliability. Additional findings show that deterministic decoding produces more stylistically stable outputs, and model scale correlates positively with cross-style consistency. RCScore offers a principled approach to assess instruction robustness.
[20] Don’t Let It Fade: Preserving Edits in Diffusion Language Models via Token Timestep Allocation
Woojin Kim,Jaeyoung Do
Main category: cs.CL
TL;DR: 该论文提出了一种名为Token Timestep Allocation (TTA)的方法,通过为每个令牌分配不同的时间步,解决了扩散语言模型(DLMs)中的更新遗忘问题,提升了生成的流畅性和可控性。
Details
Motivation: 扩散语言模型的精细控制存在脆弱性问题,表现为更新遗忘(uniform和context agnostic更新导致语义编辑丢失)。Contribution: 提出了TTA方法,通过软语义令牌排序实现对DLMs的控制,无需训练即可应用于推理阶段。
Method: TTA为每个令牌分配时间步,关键令牌早期冻结,不确定令牌继续优化,支持固定或动态策略。
Result: 在情感控制和去毒化任务中,TTA显著提升了准确性和流畅性,降低了困惑度和毒性。
Insight: 时间步分配是实现稳定可控扩散文本生成的关键,可通过软语义排序缓解更新遗忘问题。
Abstract: While diffusion language models (DLMs) enable fine-grained refinement, their practical controllability remains fragile. We identify and formally characterize a central failure mode called update forgetting, in which uniform and context agnostic updates induce token level fluctuations across timesteps, erasing earlier semantic edits and disrupting the cumulative refinement process, thereby degrading fluency and coherence. As this failure originates in uniform and context agnostic updates, effective control demands explicit token ordering. We propose Token Timestep Allocation (TTA), which realizes soft and semantic token ordering via per token timestep schedules: critical tokens are frozen early, while uncertain tokens receive continued refinement. This timestep based ordering can be instantiated as either a fixed policy or an adaptive policy driven by task signals, thereby supporting a broad spectrum of refinement strategies. Because it operates purely at inference time, it applies uniformly across various DLMs and naturally extends to diverse supervision sources. Empirically, TTA improves controllability and fluency: on sentiment control, it yields more than 20 percent higher accuracy and nearly halves perplexity using less than one fifth the steps; in detoxification, it lowers maximum toxicity (12.2 versus 14.5) and perplexity (26.0 versus 32.0). Together, these results demonstrate that softened ordering via timestep allocation is the critical lever for mitigating update forgetting and achieving stable and controllable diffusion text generation.
[21] Towards Global Retrieval Augmented Generation: A Benchmark for Corpus-Level Reasoning
Qi Luo,Xiaonan Li,Tingshuo Fan,Xinchi Chen,Xipeng Qiu
Main category: cs.CL
TL;DR: 该论文提出了首个专门评估全局检索增强生成(Global RAG)能力的基准GlobalQA,并设计了一个多工具协作框架GlobalRAG,显著提升了全局任务的性能。
Details
Motivation: 现有的RAG评估基准主要关注局部检索任务,但现实应用中需要全局分析能力,例如跨文档集合的信息聚合与推理。研究旨在填补这一空白。Contribution: 1. 提出首个全局RAG基准GlobalQA;2. 设计多工具协作框架GlobalRAG,通过智能过滤器与聚合模块显著提升性能。
Method: 提出GlobalRAG框架,结合分块检索、智能过滤器(LLM驱动)和符号计算聚合模块,以解决全局任务的挑战。
Result: 在Qwen2.5-14B模型上,GlobalRAG的F1得分为6.63,显著优于基线模型的1.51。
Insight: 全局RAG任务需要结构化处理和噪声过滤,多工具协作方法能有效提升性能。
Abstract: Retrieval-augmented generation (RAG) has emerged as a leading approach to reducing hallucinations in large language models (LLMs). Current RAG evaluation benchmarks primarily focus on what we call local RAG: retrieving relevant chunks from a small subset of documents to answer queries that require only localized understanding within specific text chunks. However, many real-world applications require a fundamentally different capability – global RAG – which involves aggregating and analyzing information across entire document collections to derive corpus-level insights (for example, “What are the top 10 most cited papers in 2023?”). In this paper, we introduce GlobalQA – the first benchmark specifically designed to evaluate global RAG capabilities, covering four core task types: counting, extremum queries, sorting, and top-k extraction. Through systematic evaluation across different models and baselines, we find that existing RAG methods perform poorly on global tasks, with the strongest baseline achieving only 1.51 F1 score. To address these challenges, we propose GlobalRAG, a multi-tool collaborative framework that preserves structural coherence through chunk-level retrieval, incorporates LLM-driven intelligent filters to eliminate noisy documents, and integrates aggregation modules for precise symbolic computation. On the Qwen2.5-14B model, GlobalRAG achieves 6.63 F1 compared to the strongest baseline’s 1.51 F1, validating the effectiveness of our method.
[22] Pragmatic Theories Enhance Understanding of Implied Meanings in LLMs
Takuma Sato,Seiya Kawano,Koichiro Yoshino
Main category: cs.CL
TL;DR: 该研究表明,将语用理论(如格赖斯语用学和关联理论)作为提示引入大语言模型,可显著提升其对隐含意义的理解能力,最高可提高9.6%。
Details
Motivation: 隐含意义的准确理解在人类沟通中至关重要,而语言模型也需具备此能力。研究旨在探索如何在零样本或少样本场景下提升语言模型的语用推理能力。Contribution: 提出将语用理论作为提示的方法,显著提升语言模型在隐含意义理解任务中的表现,最高提升9.6%;并发现仅提及理论名称也能带来小幅性能提升。
Method: 通过提供语用理论的概述作为提示,引导模型分步推理;比较了不同提示策略(如显式理论解释与仅提及理论名称)的效果。
Result: 实验证明,显式引入语用理论的提示方法在语用推理任务中优于基准方法(零样本链式思维);较大模型中,仅提及理论名称也能带来1-3%的性能提升。
Insight: 语用理论的信息可作为有效的上下文学习工具,提升语言模型的推理能力;理论框架的引入或提及能够激活模型的相关知识。
Abstract: The ability to accurately interpret implied meanings plays a crucial role in human communication and language use, and language models are also expected to possess this capability. This study demonstrates that providing language models with pragmatic theories as prompts is an effective in-context learning approach for tasks to understand implied meanings. Specifically, we propose an approach in which an overview of pragmatic theories, such as Gricean pragmatics and Relevance Theory, is presented as a prompt to the language model, guiding it through a step-by-step reasoning process to derive a final interpretation. Experimental results showed that, compared to the baseline, which prompts intermediate reasoning without presenting pragmatic theories (0-shot Chain-of-Thought), our methods enabled language models to achieve up to 9.6% higher scores on pragmatic reasoning tasks. Furthermore, we show that even without explaining the details of pragmatic theories, merely mentioning their names in the prompt leads to a certain performance improvement (around 1-3%) in larger models compared to the baseline.
[23] Language Models Are Borrowing-Blind: A Multilingual Evaluation of Loanword Identification across 10 Languages
Mérilin Sousa Silva,Sina Ahmadi
Main category: cs.CL
TL;DR: 论文研究了预训练语言模型在多语言环境下识别借词的能力,发现其表现不佳,表明模型对借词存在偏见。
Details
Motivation: 探究预训练语言模型是否能够像人类一样区分借词与本土词汇,尤其是在双语社区中占主导地位的语言对少数语言的词汇影响。Contribution: 在多语言(10种语言)环境下评估了多种模型识别借词的能力,揭示了模型对此任务的局限性。
Method: 通过明确指令和上下文信息测试了多种预训练语言模型在借词识别任务中的表现。
Result: 结果表明模型在区分借词与本土词汇方面表现不佳,验证了其对借词的偏见。
Insight: 研究发现对开发面向少数语言的NLP工具和支持语言保护有重要意义,尤其是在词汇压力较大的社区中。
Abstract: Throughout language history, words are borrowed from one language to another and gradually become integrated into the recipient’s lexicon. Speakers can often differentiate these loanwords from native vocabulary, particularly in bilingual communities where a dominant language continuously imposes lexical items on a minority language. This paper investigates whether pretrained language models, including large language models, possess similar capabilities for loanword identification. We evaluate multiple models across 10 languages. Despite explicit instructions and contextual information, our results show that models perform poorly in distinguishing loanwords from native ones. These findings corroborate previous evidence that modern NLP systems exhibit a bias toward loanwords rather than native equivalents. Our work has implications for developing NLP tools for minority languages and supporting language preservation in communities under lexical pressure from dominant languages.
[24] Distilling Multilingual Vision-Language Models: When Smaller Models Stay Multilingual
Sukrit Sriratanawilai,Jhayahgrit Thongwat,Romrawin Chumpu,Patomporn Payoungkhamdee,Sarana Nutanong,Peerat Limkonchotiwat
Main category: cs.CL
TL;DR: 本文研究了多语言视觉-语言模型(VLMs)在不同蒸馏方法下的表现,发现某些配置可以在模型大小减半的情况下保持甚至提升多语言检索鲁棒性,而其他配置则无法保持跨任务稳定性。
Details
Motivation: 多语言视觉-语言模型在不同语言上的表现不均匀,尤其是在模型大小缩减时问题更为突出。知识蒸馏(KD)在VLMs中的应用尚未在多语言环境中充分研究。Contribution: 本文对五种蒸馏方法进行了实证研究,分析了它们在跨语言表示一致性和模型压缩下下游性能稳定性方面的效果。揭示了蒸馏设计的敏感性。
Method: 研究了CLIP和SigLIP2模型上的五种蒸馏方法,并在域内检索和域外视觉问答任务中进行了评估。
Result: 研究发现,某些蒸馏配置在模型大小减半时仍能保持或提升多语言检索鲁棒性,但其他配置无法维持跨任务稳定性。
Insight: 蒸馏方法的选择对多语言VLMs的性能和稳定性有显著影响,整体精度指标无法全面反映这些影响。
Abstract: Vision-language models (VLMs) exhibit uneven performance across languages, a problem that is often exacerbated when the model size is reduced. While Knowledge distillation (KD) demonstrates promising results in transferring knowledge from larger to smaller VLMs, applying KD in multilingualism is an underexplored area. This paper presents a controlled empirical study of KD behavior across five distillation approaches, isolating their effects on cross-lingual representation consistency and downstream performance stability under model compression. We study five distillation formulations across CLIP and SigLIP2, and evaluate them on in-domain retrieval and out-of-domain visual QA. We find that some configurations preserve or even improve multilingual retrieval robustness despite halving model size, but others fail to maintain cross-task stability, exposing design-sensitive trade-offs that aggregate accuracy alone does not reveal.
[25] Do LLMs Signal When They’re Right? Evidence from Neuron Agreement
Kang Chen,Yaoning Wang,Kai Xiong,Zhuoka Feng,Wenhe Sun,Haotian Chen,Yixin Cao
Main category: cs.CL
TL;DR: 这篇论文研究了大型语言模型(LLMs)的内部神经元激活模式,发现正确的回答具有更高的稀疏性和一致性。基于此,作者提出了一种名为Neuron Agreement Decoding(NAD)的无监督解码方法,利用内部信号选择最佳候选回答,显著提升了效率和生成质量。
Details
Motivation: 现有的解码方法主要依赖外部输出信号(如词元概率或熵),但这些信号在训练后可能校准不佳。作者希望通过分析内部神经元激活模式,找到更可靠的信号来提升模型推理能力。Contribution: 提出了Neuron Agreement Decoding(NAD)方法,利用神经元激活的稀疏性和一致性选择候选回答,无需标注数据即可实现高效解码。
Method: NAD方法通过分析神经元激活的稀疏性和跨样本一致性,选择最优候选回答,支持早期预测和提前停止解码。
Result: NAD在数学和科学基准上表现与多数投票相当,在开放式编码任务中优于Avg@64,同时减少了99%的词元使用量。
Insight: 模型的内部神经元激活模式是可靠的信号来源,可用于高效的无监督解码和推理优化。
Abstract: Large language models (LLMs) commonly boost reasoning via sample-evaluate-ensemble decoders, achieving label free gains without ground truth. However, prevailing strategies score candidates using only external outputs such as token probabilities, entropies, or self evaluations, and these signals can be poorly calibrated after post training. We instead analyze internal behavior based on neuron activations and uncover three findings: (1) external signals are low dimensional projections of richer internal dynamics; (2) correct responses activate substantially fewer unique neurons than incorrect ones throughout generation; and (3) activations from correct responses exhibit stronger cross sample agreement, whereas incorrect ones diverge. Motivated by these observations, we propose Neuron Agreement Decoding (NAD), an unsupervised best-of-N method that selects candidates using activation sparsity and cross sample neuron agreement, operating solely on internal signals and without requiring comparable textual outputs. NAD enables early correctness prediction within the first 32 generated tokens and supports aggressive early stopping. Across math and science benchmarks with verifiable answers, NAD matches majority voting; on open ended coding benchmarks where majority voting is inapplicable, NAD consistently outperforms Avg@64. By pruning unpromising trajectories early, NAD reduces token usage by 99% with minimal loss in generation quality, showing that internal signals provide reliable, scalable, and efficient guidance for label free ensemble decoding.
[26] Unravelling the Mechanisms of Manipulating Numbers in Language Models
Michal Štefánik,Timothee Mickus,Marek Kadlčík,Bertram Højer,Michal Spiegel,Raúl Vázquez,Aman Sinha,Josef Kuchař,Philipp Mondorf
Main category: cs.CL
TL;DR: 该论文旨在解释大型语言模型(LLM)在处理数字信息时表现出的矛盾现象,即尽管输入嵌入表示相似且准确,但输出却常出现错误。研究发现,不同模型对数字的学习表示是系统化、高精度且通用的,并提出了一种通用的探测方法,用于追踪错误的来源。
Details
Motivation: 以往研究表明LLM在数字处理上存在错误输出,但其输入嵌入表示却高度一致。这种矛盾引发了研究兴趣,旨在揭示数字处理的机制及其准确性下限。Contribution: 1. 揭示了不同LLM对数字的系统化、高精度学习表示;2. 提出了一种通用探测方法,用于追踪错误来源;3. 为LLM架构改进提供了理论基础。
Method: 研究通过分析LLM的隐藏状态和输入上下文,探索数字表示的系统性和通用性,并设计通用探测方法,定位错误的具体层。
Result: 结果表明,尽管输出存在错误,LLM的数字表示是通用且系统化的,能够通过探测方法精确追踪错误来源。
Insight: LLM的数字处理能力有潜力通过改进探测技术和架构优化进一步提升。
Abstract: Recent work has shown that different large language models (LLMs) converge to similar and accurate input embedding representations for numbers. These findings conflict with the documented propensity of LLMs to produce erroneous outputs when dealing with numeric information. In this work, we aim to explain this conflict by exploring how language models manipulate numbers and quantify the lower bounds of accuracy of these mechanisms. We find that despite surfacing errors, different language models learn interchangeable representations of numbers that are systematic, highly accurate and universal across their hidden states and the types of input contexts. This allows us to create universal probes for each LLM and to trace information – including the causes of output errors – to specific layers. Our results lay a fundamental understanding of how pre-trained LLMs manipulate numbers and outline the potential of more accurate probing techniques in addressed refinements of LLMs’ architectures.
[27] Can Agent Conquer Web? Exploring the Frontiers of ChatGPT Atlas Agent in Web Games
Jingran Zhang,Ning Li,Justin Cui
Main category: cs.CL
TL;DR: 本文评估了OpenAI的ChatGPT Atlas代理在网页游戏中的表现,发现其在逻辑推理任务(如数独)中表现优异,但在需要实时交互和精确控制的游戏中表现不佳。
Details
Motivation: ChatGPT Atlas展示了网页交互的新能力,但其在动态、互动环境中的性能尚未充分研究,本文旨在填补这一空白。Contribution: 提供了Atlas在多种网页游戏中的早期表现评估,揭示了其在逻辑推理和实时交互任务中的能力差异。
Method: 使用浏览器游戏(如T-Rex Runner、数独、Flappy Bird等)作为测试场景,以游戏成绩为量化指标。
Result: Atlas在数独等逻辑任务中速度快于人类,但在实时游戏中表现不佳,无法通过初始障碍。
Insight: 尽管Atlas具备强大的分析能力,但在需要实时动作和精确控制的动态环境中仍存在显著限制。
Abstract: OpenAI’s ChatGPT Atlas introduces new capabilities for web interaction, enabling the model to analyze webpages, process user intents, and execute cursor and keyboard inputs directly within the browser. While its capacity for information retrieval tasks has been demonstrated, its performance in dynamic, interactive environments remains less explored. In this study, we conduct an early evaluation of Atlas’s web interaction capabilities using browser-based games as test scenarios, including Google’s T-Rex Runner, Sudoku, Flappy Bird, and Stein.world. We employ in-game performance scores as quantitative metrics to assess performance across different task types. Our results show that Atlas performs strongly in logical reasoning tasks like Sudoku, completing puzzles significantly faster than human baselines, but struggles substantially in real-time games requiring precise timing and motor control, often failing to progress beyond initial obstacles. These findings suggest that while Atlas demonstrates capable analytical processing, there remain notable limitations in dynamic web environments requiring real-time interaction. The website of our project can be found at https://atlas-game-eval.github.io.
[28] SCRIBE: Structured Chain Reasoning for Interactive Behaviour Explanations using Tool Calling
Fares Fawzi,Vinitra Swamy,Dominik Glandorf,Tanya Nazaretsky,Tanja Käser
Main category: cs.CL
TL;DR: SCRIBE是一个用于生成交互式行为解释的框架,结合了工具增强的多跳推理和自反思推理管道,适用于教育场景中低资源、隐私敏感的反馈响应。
Details
Motivation: 在教育场景中,语言模型需要提供个性化反馈,但面临隐私、计算资源和教育有效性等挑战。SCRIBE旨在解决这些问题,提供高效且可靠的解决方案。Contribution: 1. 提出SCRIBE框架,支持多跳推理、工具使用和错误恢复;2. 通过两阶段LoRA微调生成小型高效模型(3B和8B);3. 在质量和实用性上与更大模型媲美。
Method: 结合领域工具和自反思推理管道,采用两阶段LoRA微调(基于GPT-4o生成的数据)训练小型开源模型。
Result: 8B-SCRIBE模型在相关性和实用性上与更大的模型(如GPT-4o和Llama-3.3 70B)相当,且用户评价相似。
Insight: SCRIBE证明了小型模型在教育应用中的潜力,可通过工具增强和高效推理实现高性能,同时满足隐私和资源限制。
Abstract: Language models can be used to provide interactive, personalized student feedback in educational settings. However, real-world deployment faces three key challenges: privacy concerns, limited computational resources, and the need for pedagogically valid responses. These constraints require small, open-source models that can run locally and reliably ground their outputs in correct information. We introduce SCRIBE, a framework for multi-hop, tool-augmented reasoning designed to generate valid responses to student questions about feedback reports. SCRIBE combines domain-specific tools with a self-reflective inference pipeline that supports iterative reasoning, tool use, and error recovery. We distil these capabilities into 3B and 8B models via two-stage LoRA fine-tuning on synthetic GPT-4o-generated data. Evaluation with a human-aligned GPT-Judge and a user study with 108 students shows that 8B-SCRIBE models achieve comparable or superior quality to much larger models in key dimensions such as relevance and actionability, while being perceived on par with GPT-4o and Llama-3.3 70B by students. These findings demonstrate the viability of SCRIBE for low-resource, privacy-sensitive educational applications.
[29] From Amateur to Master: Infusing Knowledge into LLMs via Automated Curriculum Learning
Nishit Neema,Srinjoy Mukherjee,Sapan Shah,Gokul Ramakrishnan,Ganesh Venkatesh
Main category: cs.CL
TL;DR: 论文提出了ACER方法,通过自动生成结构化课程(知识体系+问答对)对LLMs进行持续预训练,使其从通才转变为领域专家,同时保持通用能力。实验证明ACER在专业领域表现显著提升。
Details
Motivation: 现有LLMs在专业领域(如经济学、心理学)表现不足,亟待一种方法能系统注入专业知识,同时避免牺牲通用能力。Contribution: 提出ACER框架:1)自动生成结构化课程(知识体系+基于布鲁姆分类法的QA对);2)设计交织课程学习计划,实现内容和认知维度的对齐。
Method: 1)合成结构化课程(生成目录和QA对);2)持续预训练中采用交织课程计划;3)评估时验证专业领域表现和知识迁移效果。
Result: 在MMLU专业子集(如微观经济学)提升5%,平均提升3%;非目标领域提升0.7%,知识密集型任务(ARC、GPQA)提升2+分,且通用能力稳定。
Insight: 结构化课程设计和渐进式学习计划能有效解决LLMs的专业知识鸿沟,同时支持知识迁移而避免灾难性遗忘。
Abstract: Large Language Models (LLMs) excel at general tasks but underperform in specialized domains like economics and psychology, which require deep, principled understanding. To address this, we introduce ACER (Automated Curriculum-Enhanced Regimen) that transforms generalist models into domain experts without sacrificing their broad capabilities. ACER first synthesizes a comprehensive, textbook-style curriculum by generating a table of contents for a subject and then creating question-answer (QA) pairs guided by Bloom’s taxonomy. This ensures systematic topic coverage and progressively increasing difficulty. The resulting synthetic corpus is used for continual pretraining with an interleaved curriculum schedule, aligning learning across both content and cognitive dimensions. Experiments with Llama 3.2 (1B and 3B) show significant gains in specialized MMLU subsets. In challenging domains like microeconomics, where baselines struggle, ACER boosts accuracy by 5 percentage points. Across all target domains, we observe a consistent macro-average improvement of 3 percentage points. Notably, ACER not only prevents catastrophic forgetting but also facilitates positive cross-domain knowledge transfer, improving performance on non-target domains by 0.7 points. Beyond MMLU, ACER enhances performance on knowledge-intensive benchmarks like ARC and GPQA by over 2 absolute points, while maintaining stable performance on general reasoning tasks. Our results demonstrate that ACER offers a scalable and effective recipe for closing critical domain gaps in LLMs.
[30] MisSynth: Improving MISSCI Logical Fallacies Classification with Synthetic Data
Mykhailo Poliakov,Nadiya Shvai
Main category: cs.CL
TL;DR: 论文提出MisSynth,通过RAG生成合成谬误数据,用于微调LLMs,显著提升了MISSCI数据集中逻辑谬误分类的性能。
Details
Motivation: 健康相关的错误信息广泛存在且有害,尤其是那些扭曲或误解科学发现的言论。现有方法难以准确识别这类谬误,且标注数据有限。Contribution: 1) 提出MisSynth流程,利用RAG生成合成数据;2) 展示了合成数据对零样本分类性能的显著提升;3) 开源代码和合成数据集。
Method: 1) 使用RAG生成合成谬误样本;2) 通过轻量级微调技术优化LLMs;3) 实验对比了微调前后的性能差异。
Result: LLaMA 3.1 8B模型在微调后,MISSCI测试集的F1分数比基线提升了35%。
Insight: 合成数据可以有效缓解标注数据不足的问题,即使在有限计算资源下,也能显著提升LLMs的分类性能。
Abstract: Health-related misinformation is very prevalent and potentially harmful. It is difficult to identify, especially when claims distort or misinterpret scientific findings. We investigate the impact of synthetic data generation and lightweight fine-tuning techniques on the ability of large language models (LLMs) to recognize fallacious arguments using the MISSCI dataset and framework. In this work, we propose MisSynth, a pipeline that applies retrieval-augmented generation (RAG) to produce synthetic fallacy samples, which are then used to fine-tune an LLM model. Our results show substantial accuracy gains with fine-tuned models compared to vanilla baselines. For instance, the LLaMA 3.1 8B fine-tuned model achieved an over 35% F1-score absolute improvement on the MISSCI test split over its vanilla baseline. We demonstrate that introducing synthetic fallacy data to augment limited annotated resources can significantly enhance zero-shot LLM classification performance on real-world scientific misinformation tasks, even with limited computational resources. The code and synthetic dataset are available on https://github.com/mxpoliakov/MisSynth.
[31] The Geometry of Dialogue: Graphing Language Models to Reveal Synergistic Teams for Multi-Agent Collaboration
Kotaro Furuya,Yuichi Kitagawa
Main category: cs.CL
TL;DR: 该论文提出了一种基于对话交互的无监督方法,通过构建‘语言模型图’并应用社区检测技术,自动识别协同合作的多智能体团队,无需依赖模型的先验知识。
Details
Motivation: 尽管基于大语言模型(LLMs)的多智能体协作具有潜力,但如何形成协同团队仍是一大挑战,因为模型的内部特性通常是不可见的。Contribution: 提出了一个无需先验知识的交互中心框架,通过语义对话构建语言模型图,并利用社区检测技术发现协同模型群组。
Method: 通过成对对话的语义一致性构建语言模型图,应用社区检测技术识别功能一致的模型群组。
Result: 实验表明,该方法能发现反映模型潜在特长的功能一致群组,生成的协同团队在下游任务中优于随机基线,并与人工筛选团队的表现相当。
Insight: 为自动化设计多智能体协作团队提供了新思路,揭示了对话语义一致性在模型协同中的重要性。
Abstract: While a multi-agent approach based on large language models (LLMs) represents a promising strategy to surpass the capabilities of single models, its success is critically dependent on synergistic team composition. However, forming optimal teams is a significant challenge, as the inherent opacity of most models obscures the internal characteristics necessary for effective collaboration. In this paper, we propose an interaction-centric framework for automatic team composition that does not require any prior knowledge including their internal architectures, training data, or task performances. Our method constructs a “language model graph” that maps relationships between models from the semantic coherence of pairwise conversations, and then applies community detection to identify synergistic model clusters. Our experiments with diverse LLMs demonstrate that the proposed method discovers functionally coherent groups that reflect their latent specializations. Priming conversations with specific topics identified synergistic teams which outperform random baselines on downstream benchmarks and achieve comparable accuracy to that of manually-curated teams based on known model specializations. Our findings provide a new basis for the automated design of collaborative multi-agent LLM teams.
[32] InfoFlow: Reinforcing Search Agent Via Reward Density Optimization
Kun Luo,Hongjin Qian,Zheng Liu,Ziyi Xia,Shitao Xiao,Siqi Bao,Jun Zhao,Kang Liu
Main category: cs.CL
TL;DR: 本文提出了InfoFlow框架,通过优化奖励密度解决深度搜索中奖励稀疏的问题,表现优于基线方法。
Details
Motivation: 深度搜索场景中,强化学习因奖励密度低(探索成本高但奖励稀少)而受限。本文旨在解决这一问题。Contribution: 1. 将任务分解为子问题并提供过程奖励;2. 通过失败引导提示增加成功概率;3. 使用双代理架构压缩探索成本。
Method: 1. 子问题分解;2. 失败引导提示;3. 双代理架构(研究者与细化代理)。
Result: 在多个深度搜索基准测试中显著优于基线方法,轻量级LLM表现媲美高级LLM。
Insight: 优化奖励密度可显著提升深度搜索效率,双代理架构能有效降低探索成本。
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is a promising approach for enhancing agentic deep search. However, its application is often hindered by low \textbf{Reward Density} in deep search scenarios, where agents expend significant exploratory costs for infrequent and often null final rewards. In this paper, we formalize this challenge as the \textbf{Reward Density Optimization} problem, which aims to improve the reward obtained per unit of exploration cost. This paper introduce \textbf{InfoFlow}, a systematic framework that tackles this problem from three aspects. 1) \textbf{Subproblem decomposition}: breaking down long-range tasks to assign process rewards, thereby providing denser learning signals. 2) \textbf{Failure-guided hints}: injecting corrective guidance into stalled trajectories to increase the probability of successful outcomes. 3) \textbf{Dual-agent refinement}: employing a dual-agent architecture to offload the cognitive burden of deep exploration. A refiner agent synthesizes the search history, which effectively compresses the researcher’s perceived trajectory, thereby reducing exploration cost and increasing the overall reward density. We evaluate InfoFlow on multiple agentic search benchmarks, where it significantly outperforms strong baselines, enabling lightweight LLMs to achieve performance comparable to advanced proprietary LLMs.
[33] Inference-Cost-Aware Dynamic Tree Construction for Efficient Inference in Large Language Models
Yinrong Hong,Zhiquan Tan,Kai Hu
Main category: cs.CL
TL;DR: 为了解决大语言模型(LLM)推理延迟问题,作者提出了一种动态树解码方法CAST,考虑了GPU配置和批量大小等系统变量,显著提升了推理效率,速度提升了5.2倍,并优于现有方法5%至20%。
Details
Motivation: 大语言模型由于自动回归设计和规模庞大,推理延迟严重。已有的动态树解码方法(如EAGLE-2和EAGLE-3)忽略了GPU设备和批量大小等系统变量的影响,因此需要一种更高效的解决方案。Contribution: 提出了CAST动态树解码方法,首次在动态树结构中考虑了推理成本(如GPU配置和批量大小),显著提升了LLM推理效率。
Method: CAST通过动态调整树结构以适应不同的系统变量(GPU设备和批量大小),优化了推理过程。在6个任务和6种LLM上进行了实验验证。
Result: 实验结果显示,CAST的速度比传统解码方法快5.2倍,并在6个任务中优于现有最佳方法5%至20%。
Insight: 动态树结构的优化应考虑实际的系统变量(如GPU配置和批量大小),这对提升大语言模型的推理效率至关重要。
Abstract: Large Language Models (LLMs) face significant inference latency challenges stemming from their autoregressive design and large size. To address this, speculative decoding emerges as a solution, enabling the simultaneous generation and validation of multiple tokens. While recent approaches like EAGLE-2 and EAGLE-3 improve speculative decoding using dynamic tree structures, they often neglect the impact of crucial system variables such as GPU devices and batch sizes. Therefore, we introduce a new dynamic tree decoding approach called CAST that takes into account inference costs, including factors such as GPU configurations and batch sizes, to dynamically refine the tree structure. Through comprehensive experimentation across six diverse tasks and utilizing six distinct LLMs, our methodology demonstrates remarkable results, achieving speeds up to 5.2 times faster than conventional decoding methods. Moreover, it generally outperforms existing state-of-the-art techniques from 5% to 20%.
[34] SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding
Yiqiao Jin,Rachneet Kaur,Zhen Zeng,Sumitra Ganesh,Srijan Kumar
Main category: cs.CL
TL;DR: SlideAgent是一个层次化的多代理框架,专注于多页视觉文档的理解,通过全局、页面和元素级别的分层推理,显著优于现有模型。
Details
Motivation: 当前的LLMs在多页视觉文档的理解中存在困难,尤其是在细粒度推理和跨页面信息整合方面。因此,需要一种更高效的框架来解决这一问题。Contribution: SlideAgent提出了一个层次化的代理框架,通过分层的全局、页面和元素级别推理,构建了一种结构化的表示形式,能够整合多模态、多页面和多布局的信息。
Method: SlideAgent采用了三个层次的专业代理:全局代理负责文档的整体主题,页面代理处理单页布局,元素代理专注于细粒度的视觉或文本元素。推理时动态激活代理并整合输出。
Result: 实验表明,SlideAgent在专有模型和开源模型上分别实现了7.9和9.8的整体性能提升。
Insight: 分层代理的设计可以有效地将复杂任务分解为更简单的子任务,同时动态推理机制能够灵活应对不同查询需求。
Abstract: Multi-page visual documents such as manuals, brochures, presentations, and posters convey key information through layout, colors, icons, and cross-slide references. While large language models (LLMs) offer opportunities in document understanding, current systems struggle with complex, multi-page visual documents, particularly in fine-grained reasoning over elements and pages. We introduce SlideAgent, a versatile agentic framework for understanding multi-modal, multi-page, and multi-layout documents, especially slide decks. SlideAgent employs specialized agents and decomposes reasoning into three specialized levels-global, page, and element-to construct a structured, query-agnostic representation that captures both overarching themes and detailed visual or textual cues. During inference, SlideAgent selectively activates specialized agents for multi-level reasoning and integrates their outputs into coherent, context-aware answers. Extensive experiments show that SlideAgent achieves significant improvement over both proprietary (+7.9 overall) and open-source models (+9.8 overall).
[35] Evontree: Ontology Rule-Guided Self-Evolution of Large Language Models
Mingchen Tu,Zhiqiang Liu,Juan Li,Liangyurui Liu,Junjie Wang,Lei Liang,Wen Zhang
Main category: cs.CL
TL;DR: Evontree利用高质量的领域本体规则,通过系统化的提取、验证和增强LLMs中的知识,实现在数据敏感领域(如医疗保健)的低资源适应,无需大量外部数据集。
Details
Motivation: 在数据敏感领域(如医疗保健),高质量的领域特定训练数据稀缺,限制了LLMs的专业应用。而领域专家已总结出本体规则,可用于知识管理。Contribution: 提出了Evontree框架,通过小规模高质量本体规则提取、验证和增强LLMs中的领域知识,实现低资源领域适应。
Method: 1. 从原始模型中提取领域本体;2. 使用核心本体规则检测不一致性;3. 通过自我蒸馏微调强化知识。
Result: 在医疗QA基准测试中,Evontree显著优于未修改模型和监督基线,准确率提升高达3.7%。
Insight: 本体规则可以高效引导LLMs的知识自我进化,特别适用于数据稀缺领域,且无需依赖大量标注数据。
Abstract: Large language models (LLMs) have demonstrated exceptional capabilities across multiple domains by leveraging massive pre-training and curated fine-tuning data. However, in data-sensitive fields such as healthcare, the lack of high-quality, domain-specific training corpus hinders LLMs’ adaptation for specialized applications. Meanwhile, domain experts have distilled domain wisdom into ontology rules, which formalize relationships among concepts and ensure the integrity of knowledge management repositories. Viewing LLMs as implicit repositories of human knowledge, we propose Evontree, a novel framework that leverages a small set of high-quality ontology rules to systematically extract, validate, and enhance domain knowledge within LLMs, without requiring extensive external datasets. Specifically, Evontree extracts domain ontology from raw models, detects inconsistencies using two core ontology rules, and reinforces the refined knowledge via self-distilled fine-tuning. Extensive experiments on medical QA benchmarks with Llama3-8B-Instruct and Med42-v2 demonstrate consistent outperformance over both unmodified models and leading supervised baselines, achieving up to a 3.7% improvement in accuracy. These results confirm the effectiveness, efficiency, and robustness of our approach for low-resource domain adaptation of LLMs.
[36] Kimi Linear: An Expressive, Efficient Attention Architecture
Kimi Team,Yu Zhang,Zongyu Lin,Xingcheng Yao,Jiaxi Hu,Fanqing Meng,Chengyin Liu,Xin Men,Songlin Yang,Zhiyuan Li,Wentao Li,Enzhe Lu,Weizhou Liu,Yanru Chen,Weixin Xu,Longhui Yu,Yejie Wang,Yu Fan,Longguang Zhong,Enming Yuan,Dehao Zhang,Yizhi Zhang,T. Y. Liu,Haiming Wang,Shengjun Fang,Weiran He,Shaowei Liu,Yiwei Li,Jianlin Su,Jiezhong Qiu,Bo Pang,Junjie Yan,Zhejun Jiang,Weixiao Huang,Bohong Yin,Jiacheng You,Chu Wei,Zhengtao Wang,Chao Hong,Yutian Chen,Guanduo Chen,Yucheng Wang,Huabin Zheng,Feng Wang,Yibo Liu,Mengnan Dong,Zheng Zhang,Siyuan Pan,Wenhao Wu,Yuhao Wu,Longyu Guan,Jiawen Tao,Guohong Fu,Xinran Xu,Yuzhi Wang,Guokun Lai,Yuxin Wu,Xinyu Zhou,Zhilin Yang,Yulun Du
Main category: cs.CL
TL;DR: Kimi Linear是一种创新的线性注意力架构,首次在公平比较下全面优于传统注意力机制。核心模块KDA通过精细门控机制提升了有限RNN内存的效率,结合专用DPLR矩阵优化计算效率。3B参数模型在多项任务中表现优越,同时大幅减少KV缓存和解码时间。
Details
Motivation: 传统注意力机制(如Transformer)在长上下文和高效率需求场景中存在计算和内存瓶颈。Kimi Linear旨在提供一种高性能且高效的替代方案。Contribution: 1. 提出Kimi Delta Attention(KDA),一种扩展Gated DeltaNet的线性注意力模块;2. 设计专用DPLR矩阵优化计算效率;3. 开源KDA内核和预训练模型。
Method: 1. KDA模块通过精细门控增强RNN内存利用;2. 结合MLA的层间混合策略;3. 使用专用DPLR矩阵减少计算负担。
Result: 3B参数的Kimi Linear模型在多项任务中优于传统MLA,KV缓存减少75%,解码吞吐量提升6倍。
Insight: Kimi Linear展示了线性注意力在高性能任务中的潜力,为长上下文和高效率场景提供了新方向。
Abstract: We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios – including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule. We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 times decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths. To support further research, we open-source the KDA kernel and vLLM implementations, and release the pre-trained and instruction-tuned model checkpoints.
[37] AMO-Bench: Large Language Models Still Struggle in High School Math Competitions
Shengnan An,Xunliang Cai,Xuezhi Cao,Xiaoyu Li,Yehao Lin,Junlin Liu,Xinxuan Lv,Dan Ma,Xuanlin Wang,Ziwen Wang,Shuang Zhou
Main category: cs.CL
TL;DR: AMO-Bench 是一个高级数学推理基准测试,包含50道奥林匹克数学竞赛难度的问题,用于评估大型语言模型的数学推理能力。现有基准测试因性能饱和而失效,AMO-Bench通过原创性和高难度问题填补了这一空白。实验表明,即使表现最佳的模型在AMO-Bench上也仅达到52.4%的准确率。
Details
Motivation: 现有的数学竞赛基准测试(如AIME)因其难度不足以评估顶级大型语言模型(LLMs)而失效,因此需要一个新的、更具挑战性的基准测试来推动LLMs在数学推理领域的进步。Contribution: 提出了AMO-Bench,一个包含50道原创、高难度的数学问题的基准测试,这些问题经过专家验证达到国际数学奥林匹克(IMO)难度标准,且支持自动评分。
Method: 通过专家验证和原创问题设计,确保AMO-Bench的高难度和防止数据泄露。实验测试了26种LLMs的性能,并分析了计算资源增加时的扩展趋势。
Result: 最佳LLM在AMO-Bench上的准确率为52.4%,大多数LLMs低于40%。研究还发现测试时计算资源的增加有助于性能提升。
Insight: 当前LLMs在数学推理方面仍有显著改进空间,AMO-Bench为未来的研究提供了有效的评估工具。
Abstract: We present AMO-Bench, an Advanced Mathematical reasoning benchmark with Olympiad level or even higher difficulty, comprising 50 human-crafted problems. Existing benchmarks have widely leveraged high school math competitions for evaluating mathematical reasoning capabilities of large language models (LLMs). However, many existing math competitions are becoming less effective for assessing top-tier LLMs due to performance saturation (e.g., AIME24/25). To address this, AMO-Bench introduces more rigorous challenges by ensuring all 50 problems are (1) cross-validated by experts to meet at least the International Mathematical Olympiad (IMO) difficulty standards, and (2) entirely original problems to prevent potential performance leakages from data memorization. Moreover, each problem in AMO-Bench requires only a final answer rather than a proof, enabling automatic and robust grading for evaluation. Experimental results across 26 LLMs on AMO-Bench show that even the best-performing model achieves only 52.4% accuracy on AMO-Bench, with most LLMs scoring below 40%. Beyond these poor performances, our further analysis reveals a promising scaling trend with increasing test-time compute on AMO-Bench. These results highlight the significant room for improving the mathematical reasoning in current LLMs. We release AMO-Bench to facilitate further research into advancing the reasoning abilities of language models. https://amo-bench.github.io/
cs.CV [Back]
[38] CAVE: Detecting and Explaining Commonsense Anomalies in Visual Environments
Rishika Bhagwatkar,Syrielle Montariol,Angelika Romanou,Beatriz Borges,Irina Rish,Antoine Bosselut
Main category: cs.CV
TL;DR: CAVE是首个针对真实世界视觉异常的基准测试,支持异常描述、解释和合理化三个任务,并通过精细标注评估视觉语言模型。现有的先进模型在此任务上表现不佳。
Details
Motivation: 现有计算机视觉中的异常检测多局限于工业缺陷或合成异常,未能捕捉真实世界中异常的丰富性和不可预测性。CAVE旨在填补这一空白。Contribution: CAVE是首个真实世界视觉异常的基准测试,提供多任务支持和精细标注,评估视觉语言模型的异常检测与常识推理能力。
Method: CAVE基于认知科学研究设计,支持异常描述、解释和合理化任务,并引入视觉定位和多维度分类(如复杂度、严重性和常见性)。
Result: 实验表明,现有的先进视觉语言模型在处理视觉异常感知和常识推理时表现不佳,即使采用高级提示策略。
Insight: CAVE突显了当前视觉语言模型在真实世界异常理解方面的局限性,为未来研究提供了重要的评估资源和方向。
Abstract: Humans can naturally identify, reason about, and explain anomalies in their environment. In computer vision, this long-standing challenge remains limited to industrial defects or unrealistic, synthetically generated anomalies, failing to capture the richness and unpredictability of real-world anomalies. In this work, we introduce CAVE, the first benchmark of real-world visual anomalies. CAVE supports three open-ended tasks: anomaly description, explanation, and justification; with fine-grained annotations for visual grounding and categorizing anomalies based on their visual manifestations, their complexity, severity, and commonness. These annotations draw inspiration from cognitive science research on how humans identify and resolve anomalies, providing a comprehensive framework for evaluating Vision-Language Models (VLMs) in detecting and understanding anomalies. We show that state-of-the-art VLMs struggle with visual anomaly perception and commonsense reasoning, even with advanced prompting strategies. By offering a realistic and cognitively grounded benchmark, CAVE serves as a valuable resource for advancing research in anomaly detection and commonsense reasoning in VLMs.
[39] Climate Adaptation-Aware Flood Prediction for Coastal Cities Using Deep Learning
Bilal Hassan,Areg Karapetyan,Aaron Chung Hin Chow,Samer Madanat
Main category: cs.CV
TL;DR: 论文提出了一种基于轻量级卷积神经网络(CNN)的深度学习模型,用于预测海岸城市在不同海平面上升(SLR)情景下的洪水灾害。模型在数据稀缺和高维输出的限制下表现出色,显著优于现有方法,预测洪水深度的平均绝对误差(MAE)降低了近20%。
Details
Motivation: 气候变化和海平面上升对海岸城市的威胁日益加剧,传统物理模拟方法计算成本高,不适用于城市规模的洪水预测。深度学习虽具潜力,但在数据稀缺和高维输出方面存在挑战。Contribution: 1.提出了一种轻量级CNN模型,适应多变SLR情景和海岸线调整方案;2.展示了模型在多个地理区域(阿布扎比和旧金山)的泛化能力;3.模型性能显著优于现有方法,MAE降低20%。
Method: 采用了基于视觉的低资源深度学习框架,设计了一种轻量级CNN模型,专注于预测洪水深度地图。模型通过多区域数据集验证其泛化能力。
Result: 模型在两个不同区域的数据集上表现优异,MAE平均降低20%,证明了其在海岸洪水管理中的实用性和可扩展性。
Insight: 该研究表明轻量级深度学习模型可有效解决海岸洪水预测中的计算和数据挑战,为气候变化适应策略提供了实用工具。
Abstract: Climate change and sea-level rise (SLR) pose escalating threats to coastal cities, intensifying the need for efficient and accurate methods to predict potential flood hazards. Traditional physics-based hydrodynamic simulators, although precise, are computationally expensive and impractical for city-scale coastal planning applications. Deep Learning (DL) techniques offer promising alternatives, however, they are often constrained by challenges such as data scarcity and high-dimensional output requirements. Leveraging a recently proposed vision-based, low-resource DL framework, we develop a novel, lightweight Convolutional Neural Network (CNN)-based model designed to predict coastal flooding under variable SLR projections and shoreline adaptation scenarios. Furthermore, we demonstrate the ability of the model to generalize across diverse geographical contexts by utilizing datasets from two distinct regions: Abu Dhabi and San Francisco. Our findings demonstrate that the proposed model significantly outperforms state-of-the-art methods, reducing the mean absolute error (MAE) in predicted flood depth maps on average by nearly 20%. These results highlight the potential of our approach to serve as a scalable and practical tool for coastal flood management, empowering decision-makers to develop effective mitigation strategies in response to the growing impacts of climate change. Project Page: https://caspiannet.github.io/
[40] Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders
Ali Rasekh,Erfan Bagheri Soula,Omid Daliran,Simon Gottschalk,Mohsen Fayyaz
Main category: cs.CV
TL;DR: 本文提出了通过在视觉编码器中引入堆叠时间注意力模块的Video-LLM架构,显著提升了视频中时间动态的理解能力,在视频问答任务中优于现有模型。
Details
Motivation: 现有的Video-LLM架构在理解视频中的复杂时间动态(如动作序列和时间进展)方面存在局限,亟需改进。Contribution: 提出了堆叠时间注意力模块的设计,嵌入视觉编码器以增强时间理解能力,显著提高了任务性能。
Method: 在视觉编码器中引入了堆叠时间注意力模块,使模型能够更好地捕捉动作进展和帧间关系。
Result: 在VITATECS、MVBench和Video-MME等基准测试中性能提升了高达5.5%。
Insight: 通过直接在视觉编码器中整合时间结构,可以有效弥补Video-LLM在视频理解中的关键不足。
Abstract: Despite significant advances in Multimodal Large Language Models (MLLMs), understanding complex temporal dynamics in videos remains a major challenge. Our experiments show that current Video Large Language Model (Video-LLM) architectures have critical limitations in temporal understanding, struggling with tasks that require detailed comprehension of action sequences and temporal progression. In this work, we propose a Video-LLM architecture that introduces stacked temporal attention modules directly within the vision encoder. This design incorporates a temporal attention in vision encoder, enabling the model to better capture the progression of actions and the relationships between frames before passing visual tokens to the LLM. Our results show that this approach significantly improves temporal reasoning and outperforms existing models in video question answering tasks, specifically in action recognition. We improve on benchmarks including VITATECS, MVBench, and Video-MME by up to +5.5%. By enhancing the vision encoder with temporal structure, we address a critical gap in video understanding for Video-LLMs. Project page and code are available at: https://alirasekh.github.io/STAVEQ2/.
[41] FlexICL: A Flexible Visual In-context Learning Framework for Elbow and Wrist Ultrasound Segmentation
Yuyue Zhou,Jessica Knight,Shrimanti Ghosh,Banafshe Felfeliyan,Jacob L. Jaremko,Abhilash R. Hareendranathan
Main category: cs.CV
TL;DR: FlexICL是一种灵活的视觉上下文学习框架,用于肘部和腕部超声分割,显著提升了有限标注数据下的性能表现。
Details
Motivation: 肘部和腕部骨折在儿科人群中非常常见,超声图像中的自动分割可以提高诊断准确性和治疗规划。然而,像素级专家标注耗时且昂贵。Contribution: 提出了FlexICL框架,通过创新的图像拼接方法和多增强策略,在仅需5%标注数据的情况下,实现了高效的超声图像分割。
Method: 采用视觉上下文学习(ICL)方法,专注于帧内分割任务,专家仅标注少量帧,模型分割未见帧。研究了多种图像拼接技术和训练策略。
Result: 在四个腕部和肘部超声数据集上,FlexICL的Dice系数比Painter、MAE-VQGAN及传统分割模型(如U-Net、TransUNet)高出1-27%。
Insight: FlexICL展示了在标注数据稀缺的医学影像场景中,通过高效的上下文学习和数据增强策略,能够显著提升分割性能的可扩展性。
Abstract: Elbow and wrist fractures are the most common fractures in pediatric populations. Automatic segmentation of musculoskeletal structures in ultrasound (US) can improve diagnostic accuracy and treatment planning. Fractures appear as cortical defects but require expert interpretation. Deep learning (DL) can provide real-time feedback and highlight key structures, helping lightly trained users perform exams more confidently. However, pixel-wise expert annotations for training remain time-consuming and costly. To address this challenge, we propose FlexICL, a novel and flexible in-context learning (ICL) framework for segmenting bony regions in US images. We apply it to an intra-video segmentation setting, where experts annotate only a small subset of frames, and the model segments unseen frames. We systematically investigate various image concatenation techniques and training strategies for visual ICL and introduce novel concatenation methods that significantly enhance model performance with limited labeled data. By integrating multiple augmentation strategies, FlexICL achieves robust segmentation performance across four wrist and elbow US datasets while requiring only 5% of the training images. It outperforms state-of-the-art visual ICL models like Painter, MAE-VQGAN, and conventional segmentation models like U-Net and TransUNet by 1-27% Dice coefficient on 1,252 US sweeps. These initial results highlight the potential of FlexICL as an efficient and scalable solution for US image segmentation well suited for medical imaging use cases where labeled data is scarce.
[42] Dynamic VLM-Guided Negative Prompting for Diffusion Models
Hoyeon Chang,Seungjin Kim,Yoonseok Choi
Main category: cs.CV
TL;DR: 本文提出了一种基于视觉语言模型(VLM)的动态负提示方法,用于扩散模型,能够在去噪过程中自适应生成负提示,从而提升文本与图像的匹配度。
Details
Motivation: 传统负提示方法使用固定的负提示,限制了模型的灵活性和上下文适应性。本文旨在通过动态生成的负提示解决这一问题。Contribution: 主要贡献是提出了一种利用VLM的动态负提示方法,能够在去噪过程中生成上下文相关的负提示,提升生成效果。
Method: 核心方法是在特定去噪步骤生成中间图像预测,并利用VLM生成上下文相关的负提示,实现动态负提示。
Result: 实验表明,该方法在多种基准数据集上优于传统固定负提示方法,且能平衡负引导强度与文本-图像对齐问题。
Insight: 动态负提示能更好地适应生成过程中的上下文变化,从而提升扩散模型的生成质量和灵活性。
Abstract: We propose a novel approach for dynamic negative prompting in diffusion models that leverages Vision-Language Models (VLMs) to adaptively generate negative prompts during the denoising process. Unlike traditional Negative Prompting methods that use fixed negative prompts, our method generates intermediate image predictions at specific denoising steps and queries a VLM to produce contextually appropriate negative prompts. We evaluate our approach on various benchmark datasets and demonstrate the trade-offs between negative guidance strength and text-image alignment.
[43] Security Risk of Misalignment between Text and Image in Multi-modal Model
Xiaosen Wang,Zhijin Ge,Shaokang Wang
Main category: cs.CV
TL;DR: 该论文揭示了多模态扩散模型中文本和图像对齐不足的问题,并提出了仅通过修改输入图像(而非提示词)来操控生成内容的攻击方法PReMA。
Details
Motivation: 尽管多模态扩散模型(如文本生成图像模型)取得了显著进展,但其对对抗性输入的脆弱性尚未充分研究。研究发现现有模型中文本和图像的对齐不足,可能导致不适当内容的生成。Contribution: 提出了首个仅通过对抗图像(而非提示词)操控模型输出的攻击方法PReMA,揭示了多模态扩散模型的新安全风险。
Method: 通过修改输入图像与固定提示词的结合,生成对抗性图像,从而操控模型的输出(如图像修复和风格迁移任务)。
Result: 在多种模型的图像修复和风格迁移任务中验证了PReMA的有效性,展示了其对多模态扩散模型的威胁。
Insight: 文本与图像的对齐问题可能成为多模态模型的显著安全漏洞,尤其是在固定提示词的应用场景中。
Abstract: Despite the notable advancements and versatility of multi-modal diffusion models, such as text-to-image models, their susceptibility to adversarial inputs remains underexplored. Contrary to expectations, our investigations reveal that the alignment between textual and Image modalities in existing diffusion models is inadequate. This misalignment presents significant risks, especially in the generation of inappropriate or Not-Safe-For-Work (NSFW) content. To this end, we propose a novel attack called Prompt-Restricted Multi-modal Attack (PReMA) to manipulate the generated content by modifying the input image in conjunction with any specified prompt, without altering the prompt itself. PReMA is the first attack that manipulates model outputs by solely creating adversarial images, distinguishing itself from prior methods that primarily generate adversarial prompts to produce NSFW content. Consequently, PReMA poses a novel threat to the integrity of multi-modal diffusion models, particularly in image-editing applications that operate with fixed prompts. Comprehensive evaluations conducted on image inpainting and style transfer tasks across various models confirm the potent efficacy of PReMA.
[44] EgoExo-Con: Exploring View-Invariant Video Temporal Understanding
Minjoon Jung,Junbin Xiao,Junghyun Kim,Byoung-Tak Zhang,Angela Yao
Main category: cs.CV
TL;DR: 论文研究了多视角视频中视频语言模型(Video-LLMs)的一致性问题,提出了EgoExo-Con基准,验证了模型的视角一致性缺陷,并提出View-GRPO方法提升跨视角时序理解能力。
Details
Motivation: 当前Video-LLMs在多视角视频时序理解中表现不一致,作者希望探索模型是否能在不同视角下保持一致的时序理解能力。Contribution: 1)提出EgoExo-Con基准,包含同步的自我中心和他视角视频对及自然语言查询;2)揭示了现有Video-LLMs在跨视角一致性上的不足;3)提出了View-GRPO方法,强化视角特异性时序推理并提升跨视角一致性。
Method: 1)构建EgoExo-Con基准测试;2)分析现有模型在多视角下的一致性表现;3)设计View-GRPO框架,结合强化学习优化跨视角时序推理。
Result: 实验显示现有模型在多视角下一致性较差,View-GRPO优于传统方法(如SFT和GRPO),显著提升了跨视角一致性。
Insight: 多视角视频时序理解需要更复杂的模型设计,单纯的微调难以解决一致性问题,强化学习是潜在的有效解决方案。
Abstract: Can Video-LLMs achieve consistent temporal understanding when videos capture the same event from different viewpoints? To study this, we introduce EgoExo-Con (Consistency), a benchmark of comprehensively synchronized egocentric and exocentric video pairs with human-refined queries in natural language. EgoExo-Con emphasizes two temporal understanding tasks: Temporal Verification and Temporal Grounding. It evaluates not only correctness but consistency across viewpoints. Our analysis reveals two critical limitations of existing Video-LLMs: (1) models often fail to maintain consistency, with results far worse than their single-view performances. (2) When naively finetuned with synchronized videos of both viewpoints, the models show improved consistency but often underperform those trained on a single view. For improvements, we propose View-GRPO, a novel reinforcement learning framework that effectively strengthens view-specific temporal reasoning while encouraging consistent comprehension across viewpoints. Our method demonstrates its superiority over naive SFT and GRPO, especially for improving cross-view consistency. All resources will be made publicly available.
[45] OracleAgent: A Multimodal Reasoning Agent for Oracle Bone Script Research
Caoshuo Li,Zengmao Ding,Xiaobin Hu,Bang Li,Donghao Luo,Xu Peng,Taisong Jin,Yongge Liu,Shengwei Han,Jing Yang,Xiaoping He,Feng Gao,AndyPian Wu,SevenShu,Chaoyang Wang,Chengjie Wang
Main category: cs.CV
TL;DR: OracleAgent是首個為甲骨文研究設計的多模態推理系統,結合大型語言模型和專家知識庫,顯著提升了甲骨文研究的效率和準確性。
Details
Motivation: 甲骨文研究面臨信息組織和檢索的低效性,且解讀過程複雜。為解決這些問題,提出了OracleAgent系統。Contribution: 1.第一個針對甲骨文結構化管理和檢索的代理系統;2.構建了一個全面的領域專有多模態知識庫;3.系統在多模態推理任務中表現優於主流MLLMs。
Method: 整合多種甲骨文分析工具和大型語言模型,並構建包含140萬張甲骨文拓片圖像和8萬條解讀文本的知識庫。
Result: OracleAgent在多模態推理和生成任務中表現卓越,顯著降低了研究時間成本。
Insight: 結合專家知識庫和多模態工具的代理系統能有效提升小眾領域的研究效率,且自動化解讀系統具有重要實踐價值。
Abstract: As one of the earliest writing systems, Oracle Bone Script (OBS) preserves the cultural and intellectual heritage of ancient civilizations. However, current OBS research faces two major challenges: (1) the interpretation of OBS involves a complex workflow comprising multiple serial and parallel sub-tasks, and (2) the efficiency of OBS information organization and retrieval remains a critical bottleneck, as scholars often spend substantial effort searching for, compiling, and managing relevant resources. To address these challenges, we present OracleAgent, the first agent system designed for the structured management and retrieval of OBS-related information. OracleAgent seamlessly integrates multiple OBS analysis tools, empowered by large language models (LLMs), and can flexibly orchestrate these components. Additionally, we construct a comprehensive domain-specific multimodal knowledge base for OBS, which is built through a rigorous multi-year process of data collection, cleaning, and expert annotation. The knowledge base comprises over 1.4M single-character rubbing images and 80K interpretation texts. OracleAgent leverages this resource through its multimodal tools to assist experts in retrieval tasks of character, document, interpretation text, and rubbing image. Extensive experiments demonstrate that OracleAgent achieves superior performance across a range of multimodal reasoning and generation tasks, surpassing leading mainstream multimodal large language models (MLLMs) (e.g., GPT-4o). Furthermore, our case study illustrates that OracleAgent can effectively assist domain experts, significantly reducing the time cost of OBS research. These results highlight OracleAgent as a significant step toward the practical deployment of OBS-assisted research and automated interpretation systems.
[46] JOGS: Joint Optimization of Pose Estimation and 3D Gaussian Splatting
Yuxuan Li,Tao Wang,Xianben Yang
Main category: cs.CV
TL;DR: JOGS提出了一种联合优化3D高斯点分布与相机位姿的统一框架,避免了传统方法依赖外部工具(如COLMAP)的计算瓶颈和误差传播问题。
Details
Motivation: 传统新视角合成方法依赖外部相机位姿估计工具(如COLMAP),导致计算瓶颈和误差累积。JOGS旨在通过联合优化解决这些问题。Contribution: 1.提出了一种联合优化3D高斯点分布与相机位姿的统一框架。2.设计了交替优化的两阶段策略,提升重建质量和位姿精度。
Method: 1.固定位姿,通过可微分渲染更新3D高斯参数。2.基于几何与光度约束的自定义3D光流算法优化相机位姿。
Result: 在多个数据集上,JOGS显著优于现有的无COLMAP方法,且在全局性能上超越基于COLMAP的基准方法。
Insight: 联合优化3D重建与相机位姿能有效减少投影误差,尤其在视角变化大或特征稀疏的场景中表现优异。
Abstract: Traditional novel view synthesis methods heavily rely on external camera pose estimation tools such as COLMAP, which often introduce computational bottlenecks and propagate errors. To address these challenges, we propose a unified framework that jointly optimizes 3D Gaussian points and camera poses without requiring pre-calibrated inputs. Our approach iteratively refines 3D Gaussian parameters and updates camera poses through a novel co-optimization strategy, ensuring simultaneous improvements in scene reconstruction fidelity and pose accuracy. The key innovation lies in decoupling the joint optimization into two interleaved phases: first, updating 3D Gaussian parameters via differentiable rendering with fixed poses, and second, refining camera poses using a customized 3D optical flow algorithm that incorporates geometric and photometric constraints. This formulation progressively reduces projection errors, particularly in challenging scenarios with large viewpoint variations and sparse feature distributions, where traditional methods struggle. Extensive evaluations on multiple datasets demonstrate that our approach significantly outperforms existing COLMAP-free techniques in reconstruction quality, and also surpasses the standard COLMAP-based baseline in general.
[47] WOD-E2E: Waymo Open Dataset for End-to-End Driving in Challenging Long-tail Scenarios
Runsheng Xu,Hubert Lin,Wonseok Jeon,Hao Feng,Yuliang Zou,Liting Sun,John Gorman,Kate Tolstaya,Sarah Tang,Brandyn White,Ben Sapp,Mingxing Tan,Jyh-Jing Hwang,Drago Anguelov
Main category: cs.CV
TL;DR: 论文提出了WOD-E2E数据集,专注于长尾场景下的端到端驾驶评估,填补了现有数据集和评测指标的不足,推动了自动驾驶研究的进一步发展。
Details
Motivation: 现有端到端驾驶基准主要关注常规场景,无法充分测试系统在复杂长尾场景中的潜力,且评测指标未能捕捉驾驶的多模态特征。Contribution: 1. 提出了WOD-E2E数据集,专门针对罕见的长尾驾驶场景;2. 设计了新的评测指标RFS,基于人工标注的轨迹偏好评估驾驶性能。
Method: 1. 构建包含4021段驾驶数据的数据集,覆盖罕见场景(频率<0.03%);2. 设计RFS指标,通过比较预测轨迹与人工标注偏好来评估性能。
Result: WOD-E2E数据集和RFS指标填补了现有研究的不足,并推动了端到端自动驾驶系统在复杂场景中的研究。
Insight: 研究强调了长尾场景的重要性,通过引入人工偏好标注,为评估驾驶系统提供了更贴近实际的指标。
Abstract: Vision-based end-to-end (E2E) driving has garnered significant interest in the research community due to its scalability and synergy with multimodal large language models (MLLMs). However, current E2E driving benchmarks primarily feature nominal scenarios, failing to adequately test the true potential of these systems. Furthermore, existing open-loop evaluation metrics often fall short in capturing the multi-modal nature of driving or effectively evaluating performance in long-tail scenarios. To address these gaps, we introduce the Waymo Open Dataset for End-to-End Driving (WOD-E2E). WOD-E2E contains 4,021 driving segments (approximately 12 hours), specifically curated for challenging long-tail scenarios that that are rare in daily life with an occurring frequency of less than 0.03%. Concretely, each segment in WOD-E2E includes the high-level routing information, ego states, and 360-degree camera views from 8 surrounding cameras. To evaluate the E2E driving performance on these long-tail situations, we propose a novel open-loop evaluation metric: Rater Feedback Score (RFS). Unlike conventional metrics that measure the distance between predicted way points and the logs, RFS measures how closely the predicted trajectory matches rater-annotated trajectory preference labels. We have released rater preference labels for all WOD-E2E validation set segments, while the held out test set labels have been used for the 2025 WOD-E2E Challenge. Through our work, we aim to foster state of the art research into generalizable, robust, and safe end-to-end autonomous driving agents capable of handling complex real-world situations.
[48] Exploring Object-Aware Attention Guided Frame Association for RGB-D SLAM
Ali Caglayan,Nevrez Imamoglu,Oguzhan Guclu,Ali Osman Serhatoglu,Ahmet Burak Can,Ryosuke Nakamura
Main category: cs.CV
TL;DR: 本文提出了一种将梯度注意力信息与CNN特征结合的方法,以提高RGB-D SLAM的性能,实验结果表明其在大型环境中表现更优。
Details
Motivation: 在RGB-D SLAM中,CNN特征结合梯度注意力信息可以提升任务性能,但目前这种方法的应用仍然有限。Contribution: 提出了一种基于梯度注意力信息的CNN特征增强方法,用于改进RGB-D SLAM中的帧关联性能。
Method: 通过整合网络梯度生成的层次注意力信息与CNN特征,增强语义对象的空间注意力定位能力。
Result: 实验显示,该方法在大型环境中显著优于基线方法。
Insight: 梯度注意力信息可以有效提升CNN特征在视觉任务(如SLAM)中的表现,尤其是在复杂环境中。
Abstract: Attention models have recently emerged as a powerful approach, demonstrating significant progress in various fields. Visualization techniques, such as class activation mapping, provide visual insights into the reasoning of convolutional neural networks (CNNs). Using network gradients, it is possible to identify regions where the network pays attention during image recognition tasks. Furthermore, these gradients can be combined with CNN features to localize more generalizable, task-specific attentive (salient) regions within scenes. However, explicit use of this gradient-based attention information integrated directly into CNN representations for semantic object understanding remains limited. Such integration is particularly beneficial for visual tasks like simultaneous localization and mapping (SLAM), where CNN representations enriched with spatially attentive object locations can enhance performance. In this work, we propose utilizing task-specific network attention for RGB-D indoor SLAM. Specifically, we integrate layer-wise attention information derived from network gradients with CNN feature representations to improve frame association performance. Experimental results indicate improved performance compared to baseline methods, particularly for large environments.
[49] BasicAVSR: Arbitrary-Scale Video Super-Resolution via Image Priors and Enhanced Motion Compensation
Wei Shang,Wanying Zhang,Shuhang Gu,Pengfei Zhu,Qinghua Hu,Dongwei Ren
Main category: cs.CV
TL;DR: 论文提出了一种名为BasicAVSR的新方法,通过结合图像先验和增强的运动补偿技术,实现了任意尺度视频超分辨率(AVSR)。该方法在超分辨率质量、泛化能力和推理速度上显著优于现有方法。
Details
Motivation: 视频超分辨率技术在处理多尺度缩放时面临空间细节再现、时间一致性和计算复杂度的挑战。为解决这些问题,论文提出了BasicAVSR,旨在提供一个高效且适应性强的基线方法。Contribution: 1) 结合了自适应多尺度频率先验;2) 设计了流引导传播单元以聚合时空信息;3) 引入了二阶运动补偿单元以提升空间对齐精度;4) 提出了超上采样单元以生成尺度感知且内容无关的上采样核。此外,还实现了三种传播变体以满足不同应用需求。
Method: BasicAVSR通过四个核心组件实现:图像拉普拉斯金字塔生成自适应多尺度频率先验、流引导传播单元、二阶运动补偿单元和超上采样单元。此外,设计了三种传播变体(单向RNN、带有限前瞻的单向RNN和双向RNN)以适应不同场景。
Result: 实验结果表明,BasicAVSR在超分辨率质量、泛化能力和推理速度上显著优于现有方法。其核心组件还可扩展到多种框架中。
Insight: BasicAVSR的成功表明,结合图像先验和增强的运动补偿技术可以有效提升视频超分辨率的性能。其设计的传播变体展现了在不同应用场景下的自适应能力。
Abstract: Arbitrary-scale video super-resolution (AVSR) aims to enhance the resolution of video frames, potentially at various scaling factors, which presents several challenges regarding spatial detail reproduction, temporal consistency, and computational complexity. In this paper, we propose a strong baseline BasicAVSR for AVSR by integrating four key components: 1) adaptive multi-scale frequency priors generated from image Laplacian pyramids, 2) a flow-guided propagation unit to aggregate spatiotemporal information from adjacent frames, 3) a second-order motion compensation unit for more accurate spatial alignment of adjacent frames, and 4) a hyper-upsampling unit to generate scale-aware and content-independent upsampling kernels. To meet diverse application demands, we instantiate three propagation variants: (i) a unidirectional RNN unit for strictly online inference, (ii) a unidirectional RNN unit empowered with a limited lookahead that tolerates a small output delay, and (iii) a bidirectional RNN unit designed for offline tasks where computational resources are less constrained. Experimental results demonstrate the effectiveness and adaptability of our model across these different scenarios. Through extensive experiments, we show that BasicAVSR significantly outperforms existing methods in terms of super-resolution quality, generalization ability, and inference speed. Our work not only advances the state-of-the-art in AVSR but also extends its core components to multiple frameworks for diverse scenarios. The code is available at https://github.com/shangwei5/BasicAVSR.
[50] MV-MLM: Bridging Multi-View Mammography and Language for Breast Cancer Diagnosis and Risk Prediction
Shunjie-Fabian Zheng,Hyeonjun Lee,Thijs Kooi,Ali Diba
Main category: cs.CV
TL;DR: MV-MLM通过多视角乳腺X光片与语言的结合,提出了一种新型的跨模态自监督学习方法,显著提升了乳腺癌分类和风险预测的性能。
Details
Motivation: 现有乳腺癌CAD模型依赖大规模标注数据,但其获取成本高且耗时。视觉-语言模型(VLM)可通过跨模态学习提升数据效率和鲁棒性。Contribution: 1)提出MV-MLM模型,结合多视角乳腺X光片和合成放射报告进行跨模态学习;2)设计联合视觉-文本学习策略;3)在多个分类任务中实现SOTA性能。
Method: 采用跨模态自监督学习,结合多视角乳腺X光片和伪放射报告,通过联合视觉-文本学习策略优化模型。
Result: 在恶性肿瘤分类、亚型分类和癌症风险预测任务中均达到SOTA,且仅需合成文本报告,无需真实放射报告。
Insight: 跨模态学习能有效利用合成文本和图像数据,显著减少对标注数据的依赖,同时在医学图像任务中提升性能。
Abstract: Large annotated datasets are essential for training robust Computer-Aided Diagnosis (CAD) models for breast cancer detection or risk prediction. However, acquiring such datasets with fine-detailed annotation is both costly and time-consuming. Vision-Language Models (VLMs), such as CLIP, which are pre-trained on large image-text pairs, offer a promising solution by enhancing robustness and data efficiency in medical imaging tasks. This paper introduces a novel Multi-View Mammography and Language Model for breast cancer classification and risk prediction, trained on a dataset of paired mammogram images and synthetic radiology reports. Our MV-MLM leverages multi-view supervision to learn rich representations from extensive radiology data by employing cross-modal self-supervision across image-text pairs. This includes multiple views and the corresponding pseudo-radiology reports. We propose a novel joint visual-textual learning strategy to enhance generalization and accuracy performance over different data types and tasks to distinguish breast tissues or cancer characteristics(calcification, mass) and utilize these patterns to understand mammography images and predict cancer risk. We evaluated our method on both private and publicly available datasets, demonstrating that the proposed model achieves state-of-the-art performance in three classification tasks: (1) malignancy classification, (2) subtype classification, and (3) image-based cancer risk prediction. Furthermore, the model exhibits strong data efficiency, outperforming existing fully supervised or VLM baselines while trained on synthetic text reports and without the need for actual radiology reports.
[51] Detecting Unauthorized Vehicles using Deep Learning for Smart Cities: A Case Study on Bangladesh
Sudipto Das Sukanto,Diponker Roy,Fahim Shakil,Nirjhar Singha,Abdullah Asik,Aniket Joarder,Mridha Md Nafis Fuad,Muhammad Ibrahim
Main category: cs.CV
TL;DR: 这篇论文提出了一种基于深度学习的方法,使用YOLOv8模型实时检测孟加拉国城市中的自动三轮车(auto-rickshaws),以解决现有监控系统难以区分自动与非自动三轮车的问题。
Details
Motivation: 在孟加拉国等南亚国家,自动三轮车在某些路段被禁止行驶,但由于其与非自动三轮车外观相似,现有监控系统难以有效区分,人工分析又耗时。因此,需要一种自动化的解决方案。Contribution: 主要贡献包括:1)提出了一种基于YOLOv8的实时目标检测方法;2)公开了一个包含1,730张标注图像的自动三轮车数据集;3)验证了模型在密集和稀疏交通场景中的有效性。
Method: 采用YOLOv8模型进行实时目标检测,使用自建的1,730张标注图像数据集训练模型。模型在多种交通条件下测试,性能指标包括mAP50、精确率和召回率。
Result: 模型表现优秀,mAP50达83.447%,二元精确率和召回率均超过78%,在密集和稀疏交通场景中均有效。
Insight: 深度学习方法在大规模交通监控中具有潜力,尤其是在需要区分外观相似的车辆时。公开数据集有助于进一步研究。
Abstract: Modes of transportation vary across countries depending on geographical location and cultural context. In South Asian countries rickshaws are among the most common means of local transport. Based on their mode of operation, rickshaws in cities across Bangladesh can be broadly classified into non-auto (pedal-powered) and auto-rickshaws (motorized). Monitoring the movement of auto-rickshaws is necessary as traffic rules often restrict auto-rickshaws from accessing certain routes. However, existing surveillance systems make it quite difficult to monitor them due to their similarity to other vehicles, especially non-auto rickshaws whereas manual video analysis is too time-consuming. This paper presents a machine learning-based approach to automatically detect auto-rickshaws in traffic images. In this system, we used real-time object detection using the YOLOv8 model. For training purposes, we prepared a set of 1,730 annotated images that were captured under various traffic conditions. The results show that our proposed model performs well in real-time auto-rickshaw detection and offers an mAP50 of 83.447% and binary precision and recall values above 78%, demonstrating its effectiveness in handling both dense and sparse traffic scenarios. The dataset has been publicly released for further research.
[52] CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark
Jiaqi Wang,Xiao Yang,Kai Sun,Parth Suresh,Sanat Sharma,Adam Czyzewski,Derek Andersen,Surya Appini,Arkav Banerjee,Sajal Choudhary,Shervin Ghasemlou,Ziqiang Guan,Akil Iyer,Haidar Khan,Lingkun Kong,Roy Luo,Tiffany Ma,Zhen Qiao,David Tran,Wenfang Xu,Skyler Yeatman,Chen Zhou,Gunveer Gujral,Yinglong Xia,Shane Moon,Nicolas Scheffer,Nirav Shah,Eun Chang,Yue Liu,Florian Metze,Tammy Stark,Zhaleh Feizollahi,Andrea Jessee,Mangesh Pujari,Ahmed Aly,Babak Damavandi,Rakesh Wanga,Anuj Kumar,Rohit Patel,Wen-tau Yih,Xin Luna Dong
Main category: cs.CV
TL;DR: CRAG-MM是一个多模态多轮对话的综合RAG基准测试,填补了可穿戴设备场景下的任务空白,包含6.5K图像-问题-答案三元组和2K多轮对话,覆盖13个领域。
Details
Motivation: 现有MM-RAG任务缺乏针对可穿戴设备场景的全面基准测试,CRAG-MM旨在填补这一空白。Contribution: 推出CRAG-MM基准,包含多样化的数据集(6.5K三元组和2K多轮对话),模拟真实场景的挑战,并设计了三种任务及评估方法。
Method: 设计单源增强、多源增强和多轮对话三种任务,提供图像-KG检索和网页检索的API,评估现有RAG方法的性能。
Result: 基线方法在多轮QA中的真实性仅为43%,而行业解决方案提升有限(45%),获胜方案将性能提升了28%。
Insight: CRAG-MM推动了MM-RAG的发展,揭示了现有方法的不足和未来的改进空间。
Abstract: Wearable devices such as smart glasses are transforming the way people interact with their surroundings, enabling users to seek information regarding entities in their view. Multi-Modal Retrieval-Augmented Generation (MM-RAG) plays a key role in supporting such questions, yet there is still no comprehensive benchmark for this task, especially regarding wearables scenarios. To fill this gap, we present CRAG-MM – a Comprehensive RAG benchmark for Multi-modal Multi-turn conversations. CRAG-MM contains a diverse set of 6.5K (image, question, answer) triplets and 2K visual-based multi-turn conversations across 13 domains, including 6.2K egocentric images designed to mimic captures from wearable devices. We carefully constructed the questions to reflect real-world scenarios and challenges, including five types of image-quality issues, six question types, varying entity popularity, differing information dynamism, and different conversation turns. We design three tasks: single-source augmentation, multi-source augmentation, and multi-turn conversations – each paired with an associated retrieval corpus and APIs for both image-KG retrieval and webpage retrieval. Our evaluation shows that straightforward RAG approaches achieve only 32% and 43% truthfulness on CRAG-MM single- and multi-turn QA, respectively, whereas state-of-the-art industry solutions have similar quality (32%/45%), underscoring ample room for improvement. The benchmark has hosted KDD Cup 2025, attracting about 1K participants and 5K submissions, with winning solutions improving baseline performance by 28%, highlighting its early impact on advancing the field.
[53] MoTDiff: High-resolution Motion Trajectory estimation from a single blurred image using Diffusion models
Wontae Choi,Jaelin Lee,Hyung Sup Yun,Byeungwoo Jeon,Il Yong Chun
Main category: cs.CV
TL;DR: MoTDiff 提出了一种基于扩散模型的高分辨率运动轨迹估计方法,可以从单一模糊图像中估计高质量的运动轨迹,优于现有方法。
Details
Motivation: 现有方法从单一模糊图像中提取的运动信息通常是粗糙且不准确的,限制了其在计算成像和计算机视觉中的应用。Contribution: 提出了首个基于扩散模型的高分辨率运动轨迹估计框架(MoTDiff),实现了从单一模糊图像中估计高质量运动轨迹的目标。
Method: 1) 基于多尺度特征图的扩散模型条件框架;2) 新的训练方法,确保轨迹的精细识别、形状与位置的一致性以及像素连通性。
Result: MoTDiff 在盲图像去模糊和编码曝光摄影应用中优于最先进方法。
Insight: 扩散模型可以有效地应用于高分辨率运动轨迹估计,未来可能扩展到其他运动相关的计算机视觉任务。
Abstract: Accurate estimation of motion information is crucial in diverse computational imaging and computer vision applications. Researchers have investigated various methods to extract motion information from a single blurred image, including blur kernels and optical flow. However, existing motion representations are often of low quality, i.e., coarse-grained and inaccurate. In this paper, we propose the first high-resolution (HR) Motion Trajectory estimation framework using Diffusion models (MoTDiff). Different from existing motion representations, we aim to estimate an HR motion trajectory with high-quality from a single motion-blurred image. The proposed MoTDiff consists of two key components: 1) a new conditional diffusion framework that uses multi-scale feature maps extracted from a single blurred image as a condition, and 2) a new training method that can promote precise identification of a fine-grained motion trajectory, consistent estimation of overall shape and position of a motion path, and pixel connectivity along a motion trajectory. Our experiments demonstrate that the proposed MoTDiff can outperform state-of-the-art methods in both blind image deblurring and coded exposure photography applications.
[54] ConceptScope: Characterizing Dataset Bias via Disentangled Visual Concepts
Jinho Choi,Hyesu Lim,Steffen Schneider,Jaegul Choo
Main category: cs.CV
TL;DR: ConceptScope是一个自动化的框架,通过使用稀疏自编码器(Sparse Autoencoders)分析视觉数据集,发现和量化人类可理解的概念,从而检测数据集的偏差。
Details
Motivation: 数据集偏差在机器学习中普遍存在,但系统地识别这些偏差通常需要昂贵的细粒度标注。ConceptScope旨在无需标注的情况下,自动发现数据集中的偏差。Contribution: 提出了ConceptScope框架,通过解耦视觉概念并对其进行分类(目标、上下文和偏差),实现对数据集偏差的识别和量化。
Method: 使用基于视觉基础模型的稀疏自编码器提取概念,并根据语义相关性和统计相关性对概念进行分类。通过概念激活生成空间归因图,验证概念的语义对齐性。
Result: ConceptScope能够捕捉多种视觉概念(如物体、纹理、情感等),并成功检测已知和新发现的偏差。
Insight: ConceptScope为数据集审计和模型诊断提供了实用工具,尤其适用于未标注数据的偏差分析。
Abstract: Dataset bias, where data points are skewed to certain concepts, is ubiquitous in machine learning datasets. Yet, systematically identifying these biases is challenging without costly, fine-grained attribute annotations. We present ConceptScope, a scalable and automated framework for analyzing visual datasets by discovering and quantifying human-interpretable concepts using Sparse Autoencoders trained on representations from vision foundation models. ConceptScope categorizes concepts into target, context, and bias types based on their semantic relevance and statistical correlation to class labels, enabling class-level dataset characterization, bias identification, and robustness evaluation through concept-based subgrouping. We validate that ConceptScope captures a wide range of visual concepts, including objects, textures, backgrounds, facial attributes, emotions, and actions, through comparisons with annotated datasets. Furthermore, we show that concept activations produce spatial attributions that align with semantically meaningful image regions. ConceptScope reliably detects known biases (e.g., background bias in Waterbirds) and uncovers previously unannotated ones (e.g, co-occurring objects in ImageNet), offering a practical tool for dataset auditing and model diagnostics.
[55] Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models
Shiho Matta,Lis Kanashiro Pereira,Peitao Han,Fei Cheng,Shigeru Kitazawa
Main category: cs.CV
TL;DR: 这篇论文提出了一个新的基准测试AoT-PsyPhyBENCH,用于评估视觉语言模型(VLMs)在判断视频时间方向(前进或后退)的能力,结果显示当前VLMs在时间连续性和因果理解方面存在明显不足。
Details
Motivation: 现代VLMs在多模态任务上表现优异,但其对视频中时间信息的理解和评估仍然不足,尤其是在时间方向的判断能力上。Contribution: 设计了一个基于心理物理学的基准测试AoT-PsyPhyBENCH,评估VLMs在时间方向判断任务中的表现,并揭示了现有模型的局限性。
Method: 使用人为验证的视频刺激和行为基线,评估了多种VLMs在时间方向判断任务中的表现。
Result: 大多数VLMs表现接近随机猜测,即使在人类容易识别的物理不可逆过程和因果动作上,最佳模型也远远落后于人类准确率。
Insight: 当前的多模态系统虽然能捕捉丰富的视觉-语义关联,但缺乏时间连续性和因果理解的归纳偏置。
Abstract: Modern vision-language models (VLMs) excel at many multimodal tasks, yet their grasp of temporal information in video remains weak and, crucially, under-evaluated. We probe this gap with a deceptively simple but revealing challenge: judging the arrow of time (AoT)-whether a short clip is played forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests whether VLMs can infer temporal direction in natural videos using the same stimuli and behavioral baselines established for humans. Our comprehensive evaluation of open-weight and proprietary, reasoning and non-reasoning VLMs reveals that most models perform near chance, and even the best lag far behind human accuracy on physically irreversible processes (e.g., free fall, diffusion/explosion) and causal manual actions (division/addition) that humans recognize almost instantly. These results highlight a fundamental gap in current multimodal systems: while they capture rich visual-semantic correlations, they lack the inductive biases required for temporal continuity and causal understanding. We release the code and data for AoT-PsyPhyBENCH to encourage further progress in the physical and temporal reasoning capabilities of VLMs.
[56] A Hybrid Framework Bridging CNN and ViT based on Theory of Evidence for Diabetic Retinopathy Grading
Junlai Qiu,Yunzhu Chen,Hao Zheng,Yawen Huang,Yuexiang Li
Main category: cs.CV
TL;DR: 提出了一种基于证据理论的混合框架,结合CNN和ViT的优势,提升糖尿病视网膜病变分级的性能。
Details
Motivation: 糖尿病视网膜病变是导致中老年人视力丧失的主要原因,现有基于单一CNN或ViT的方法性能遇到瓶颈,亟需结合两者优势提升诊断效果。Contribution: 1. 提出了一种新颖的证据理论融合范式,通过深度证据网络将不同主干网络的特征转化为支持证据;2. 动态调整CNN和ViT的融合比例,提升模型性能;3. 提供特征融合和决策的可解释性。
Method: 通过深度证据网络将CNN和ViT的特征转化为证据,形成聚合意见(aggregated opinion),自适应调整融合模式。在两公开数据集上进行实验。
Result: 实验结果表明,混合模型在准确性上优于现有方法,同时具备良好的特征融合和决策可解释性。
Insight: 结合CNN的局部特征提取能力和ViT的全局特征捕获能力,通过证据理论动态融合,能够突破单一主干网络的性能瓶颈,同时增强模型的可解释性。
Abstract: Diabetic retinopathy (DR) is a leading cause of vision loss among middle-aged and elderly people, which significantly impacts their daily lives and mental health. To improve the efficiency of clinical screening and enable the early detection of DR, a variety of automated DR diagnosis systems have been recently established based on convolutional neural network (CNN) or vision Transformer (ViT). However, due to the own shortages of CNN / ViT, the performance of existing methods using single-type backbone has reached a bottleneck. One potential way for the further improvements is integrating different kinds of backbones, which can fully leverage the respective strengths of them (\emph{i.e.,} the local feature extraction capability of CNN and the global feature capturing ability of ViT). To this end, we propose a novel paradigm to effectively fuse the features extracted by different backbones based on the theory of evidence. Specifically, the proposed evidential fusion paradigm transforms the features from different backbones into supporting evidences via a set of deep evidential networks. With the supporting evidences, the aggregated opinion can be accordingly formed, which can be used to adaptively tune the fusion pattern between different backbones and accordingly boost the performance of our hybrid model. We evaluated our method on two publicly available DR grading datasets. The experimental results demonstrate that our hybrid model not only improves the accuracy of DR grading, compared to the state-of-the-art frameworks, but also provides the excellent interpretability for feature fusion and decision-making.
[57] GLYPH-SR: Can We Achieve Both High-Quality Image Super-Resolution and High-Fidelity Text Recovery via VLM-guided Latent Diffusion Model?
Mingyu Sung,Seungjae Ham,Kangwoo Kim,Yeokyoung Yoon,Sangseok Yun,Il-Min Kim,Jae-Mo Kang
Main category: cs.CV
TL;DR: GLYPH-SR是一种基于视觉语言引导的扩散模型框架,旨在同时实现高质量图像超分辨率和高保真文本恢复。通过结合OCR数据和创新的调度策略,它在OCR指标和视觉质量上均显著优于现有方法。
Details
Motivation: 现有超分辨率研究通常关注整体图像的感知质量(如PSNR/SSIM),但对嵌入复杂场景中的文本恢复效果不佳。这在文档分析、自动驾驶等实际应用中会导致关键文本信息丢失。因此,需要一种方法能同时优化文本可读性和图像质量。Contribution: 1. 提出了GLYPH-SR框架,首次实现了文本恢复与图像超分辨率的联合优化;2. 设计了Text-SR Fusion ControlNet和ping-pong调度器,分别用于文本恢复和场景感知;3. 在合成数据上进行了针对性训练,同时保持主SR分支冻结。
Method: 1. 使用TS-ControlNet结合OCR数据进行文本恢复;2. 引入ping-pong调度器,交替优化文本和场景恢复目标;3. 主SR分支冻结,提升训练效率。
Result: 在SVT、SCUT-CTW1500和CUTE80数据集上,GLYPH-SR在4倍和8倍超分辨率下,OCR F1分数提升最高达15.18%,同时保持MANIQA、CLIP-IQA和MUSIQ等感知指标的竞争力。
Insight: 1. 文本恢复需要与超分辨率任务联合优化,而非简单视为普通纹理;2. 混合调度策略(交替优化)能有效平衡文本与场景恢复需求;3. OCR数据可以作为关键指导信息,提升模型对文本的敏感性。
Abstract: Image super-resolution(SR) is fundamental to many vision system-from surveillance and autonomy to document analysis and retail analytics-because recovering high-frequency details, especially scene-text, enables reliable downstream perception. Scene-text, i.e., text embedded in natural images such as signs, product labels, and storefronts, often carries the most actionable information; when characters are blurred or hallucinated, optical character recognition(OCR) and subsequent decisions fail even if the rest of the image appears sharp. Yet previous SR research has often been tuned to distortion (PSNR/SSIM) or learned perceptual metrics (LIPIS, MANIQA, CLIP-IQA, MUSIQ) that are largely insensitive to character-level errors. Furthermore, studies that do address text SR often focus on simplified benchmarks with isolated characters, overlooking the challenges of text within complex natural scenes. As a result, scene-text is effectively treated as generic texture. For SR to be effective in practical deployments, it is therefore essential to explicitly optimize for both text legibility and perceptual quality. We present GLYPH-SR, a vision-language-guided diffusion framework that aims to achieve both objectives jointly. GLYPH-SR utilizes a Text-SR Fusion ControlNet(TS-ControlNet) guided by OCR data, and a ping-pong scheduler that alternates between text- and scene-centric guidance. To enable targeted text restoration, we train these components on a synthetic corpus while keeping the main SR branch frozen. Across SVT, SCUT-CTW1500, and CUTE80 at x4, and x8, GLYPH-SR improves OCR F1 by up to +15.18 percentage points over diffusion/GAN baseline (SVT x8, OpenOCR) while maintaining competitive MANIQA, CLIP-IQA, and MUSIQ. GLYPH-SR is designed to satisfy both objectives simultaneously-high readability and high visual realism-delivering SR that looks right and reds right.
[58] EEG-Driven Image Reconstruction with Saliency-Guided Diffusion Models
Igor Abramov,Ilya Makarov
Main category: cs.CV
TL;DR: 该论文提出了一个结合EEG嵌入和空间显著性图的双条件框架,通过改进的图像生成方法解决了现有EEG驱动图像重建方法忽视空间注意力机制的问题。
Details
Motivation: 现有EEG驱动的图像重建方法通常缺乏空间注意力机制,导致生成的图像在保真度和语义一致性上表现不佳。为了解决这一问题,作者提出了一个新的方法。Contribution: 1. 提出双条件框架,结合EEG嵌入和显著性图;2. 利用ATM提取EEG特征;3. 通过LoRA微调Stable Diffusion 2.1,对齐神经信号与视觉语义;4. 引入ControlNet分支实现空间控制。
Method: 1. Adaptive Thinking Mapper (ATM)提取EEG特征;2. 使用LoRA微调Stable Diffusion 2.1;3. ControlNet分支基于显著性图控制生成过程;4. 双条件框架实现EEG与显著性图的联合优化。
Result: 在THINGS-EEG数据集上评估,该方法在低层和高层图像特征质量上显著优于现有方法,同时与人类视觉注意力强对齐。
Insight: 研究表明,注意力先验能够解决EEG信号的模糊性,从而生成高质量的图像重建结果,这对于医学诊断和神经自适应接口具有重要的应用价值。
Abstract: Existing EEG-driven image reconstruction methods often overlook spatial attention mechanisms, limiting fidelity and semantic coherence. To address this, we propose a dual-conditioning framework that combines EEG embeddings with spatial saliency maps to enhance image generation. Our approach leverages the Adaptive Thinking Mapper (ATM) for EEG feature extraction and fine-tunes Stable Diffusion 2.1 via Low-Rank Adaptation (LoRA) to align neural signals with visual semantics, while a ControlNet branch conditions generation on saliency maps for spatial control. Evaluated on THINGS-EEG, our method achieves a significant improvement in the quality of low- and high-level image features over existing approaches. Simultaneously, strongly aligning with human visual attention. The results demonstrate that attentional priors resolve EEG ambiguities, enabling high-fidelity reconstructions with applications in medical diagnostics and neuroadaptive interfaces, advancing neural decoding through efficient adaptation of pre-trained diffusion models.
[59] LoCoT2V-Bench: A Benchmark for Long-Form and Complex Text-to-Video Generation
Xiangqing Zheng,Chengyue Wu,Kehai Chen,Min Zhang
Main category: cs.CV
TL;DR: LoCoT2V-Bench是一个专为长视频生成(LVG)设计的基准测试,针对复杂输入场景,提出了一套多维评估框架和新指标。它揭示了现有模型在事件一致性、细粒度对齐和高层次主题表达方面的不足。
Details
Motivation: 当前文本到视频生成的评估多集中于短片段和简化提示,缺乏对长视频和复杂提示的细粒度及高层次标准评估。Contribution: 1)设计了专为复杂长视频生成的LoCoT2V-Bench基准测试;2)提出新评估指标如事件级对齐、人类期望实现度(HERD);3)对九种代表性模型进行全面评估。
Method: 基于真实视频构建复杂提示,设计多维评估框架,包括细粒度时间一致性、内容清晰度和高层次主题表达(HERD)。
Result: 现有模型在基础视觉和时间表现良好,但在事件一致性、细粒度对齐和高层次主题表达上仍有不足。
Insight: 未来研究需关注长视频的叙事连贯性、情感响应和角色发展等抽象维度。
Abstract: Recently text-to-video generation has made impressive progress in producing short, high-quality clips, but evaluating long-form outputs remains a major challenge especially when processing complex prompts. Existing benchmarks mostly rely on simplified prompts and focus on low-level metrics, overlooking fine-grained alignment with prompts and abstract dimensions such as narrative coherence and thematic expression. To address these gaps, we propose LoCoT2V-Bench, a benchmark specifically designed for long video generation (LVG) under complex input conditions. Based on various real-world videos, LoCoT2V-Bench introduces a suite of realistic and complex prompts incorporating elements like scene transitions and event dynamics. Moreover, it constructs a multi-dimensional evaluation framework that includes our newly proposed metrics such as event-level alignment, fine-grained temporal consistency, content clarity, and the Human Expectation Realization Degree (HERD) that focuses on more abstract attributes like narrative flow, emotional response, and character development. Using this framework, we conduct a comprehensive evaluation of nine representative LVG models, finding that while current methods perform well on basic visual and temporal aspects, they struggle with inter-event consistency, fine-grained alignment, and high-level thematic adherence, etc. Overall, LoCoT2V-Bench provides a comprehensive and reliable platform for evaluating long-form complex text-to-video generation and highlights critical directions for future method improvement.
[60] A-TPT: Angular Diversity Calibration Properties for Test-Time Prompt Tuning of Vision-Language Models
Shihab Aaqil Ahamed,Udaya S. K. P. Miriya Thanthrige,Ranga Rodrigo,Muhammad Haris Khan
Main category: cs.CV
TL;DR: 本文提出了一种新的测试时提示调优框架A-TPT,通过最大化归一化文本特征的最小成对角度距离,提升了视觉语言模型的校准性能,并在多个数据集和任务中表现出色。
Details
Motivation: 现有的测试时提示调优(TPT)方法在文本特征的角度多样性上存在不足,影响了模型的校准性能和可靠性。因此,本文旨在通过引入角度多样性来解决这一问题。Contribution: 1. 提出A-TPT框架,首次在TPT中强调角度多样性的重要性;2. 通过最大化最小成对角度距离实现特征分布的均匀性;3. 在多个数据集上验证了方法的有效性和泛化能力。
Method: A-TPT通过在单位超球面上最大化归一化文本特征的最小成对角度距离,实现特征分布的均匀性。该方法避免了传统正交约束的局限,提升了特征的校准性能。
Result: A-TPT在降低平均校准误差方面优于现有TPT方法,同时在精度上保持可比性。在自然分布偏移和医学数据集上的零样本校准表现尤为突出。
Insight: 角度多样性对提升视觉语言模型的校准性能和可靠性至关重要,均匀的特征分布可以显著改善模型的测试时适应能力。
Abstract: Test-time prompt tuning (TPT) has emerged as a promising technique for adapting large vision-language models (VLMs) to unseen tasks without relying on labeled data. However, the lack of dispersion between textual features can hurt calibration performance, which raises concerns about VLMs’ reliability, trustworthiness, and safety. Current TPT approaches primarily focus on improving prompt calibration by either maximizing average textual feature dispersion or enforcing orthogonality constraints to encourage angular separation. However, these methods may not always have optimal angular separation between class-wise textual features, which implies overlooking the critical role of angular diversity. To address this, we propose A-TPT, a novel TPT framework that introduces angular diversity to encourage uniformity in the distribution of normalized textual features induced by corresponding learnable prompts. This uniformity is achieved by maximizing the minimum pairwise angular distance between features on the unit hypersphere. We show that our approach consistently surpasses state-of-the-art TPT methods in reducing the aggregate average calibration error while maintaining comparable accuracy through extensive experiments with various backbones on different datasets. Notably, our approach exhibits superior zero-shot calibration performance on natural distribution shifts and generalizes well to medical datasets. We provide extensive analyses, including theoretical aspects, to establish the grounding of A-TPT. These results highlight the potency of promoting angular diversity to achieve well-dispersed textual features, significantly improving VLM calibration during test-time adaptation. Our code will be made publicly available.
[61] PointSt3R: Point Tracking through 3D Grounded Correspondence
Rhodri Guerrier,Adam W. Harley,Dima Damen
Main category: cs.CV
TL;DR: 论文提出PointSt3R方法,通过3D基础对应关系进行点追踪,结合重建损失和动态对应训练,在多个数据集上取得竞争性或更优的点追踪结果。
Details
Motivation: 当前基础3D重建模型(如DUSt3R和MASt3R)在静态场景中的2D和3D对应关系表现优秀,但如何将其扩展到动态点追踪任务仍需探索。Contribution: 1)展示了这些模型在静态点追踪任务中的竞争力;2)提出结合重建损失和动态对应训练的方法;3)在小规模合成数据上微调MASt3R,显著提升了动态点追踪性能。
Method: 1)使用重建损失和动态对应训练;2)引入可见性头;3)在帧对上进行训练和评估,去除时间上下文;4)结合动态和静态点对应。
Result: 在四个数据集上取得了竞争性或更优的结果(如TAP-Vid-DAVIS的73.8 δ_avg和85.8%遮挡准确率,显著超越CoTracker3)。
Insight: 1)动态对应训练显著提升点追踪性能;2)小规模合成数据微调足够高效;3)3D基础对应关系在动态任务中具有潜力。
Abstract: Recent advances in foundational 3D reconstruction models, such as DUSt3R and MASt3R, have shown great potential in 2D and 3D correspondence in static scenes. In this paper, we propose to adapt them for the task of point tracking through 3D grounded correspondence. We first demonstrate that these models are competitive point trackers when focusing on static points, present in current point tracking benchmarks ($+33.5%$ on EgoPoints vs. CoTracker2). We propose to combine the reconstruction loss with training for dynamic correspondence along with a visibility head, and fine-tuning MASt3R for point tracking using a relatively small amount of synthetic data. Importantly, we only train and evaluate on pairs of frames where one contains the query point, effectively removing any temporal context. Using a mix of dynamic and static point correspondences, we achieve competitive or superior point tracking results on four datasets (e.g. competitive on TAP-Vid-DAVIS 73.8 $\delta_{avg}$ / 85.8% occlusion acc. for PointSt3R compared to 75.7 / 88.3% for CoTracker2; and significantly outperform CoTracker3 on EgoPoints 61.3 vs 54.2 and RGB-S 87.0 vs 82.8). We also present results on 3D point tracking along with several ablations on training datasets and percentage of dynamic correspondences.
[62] Towards Fine-Grained Vision-Language Alignment for Few-Shot Anomaly Detection
Yuanting Fan,Jun Liu,Xiaochen Chen,Bin-Bin Gao,Jian Li,Yong Liu,Jinlong Peng,Chengjie Wang
Main category: cs.CV
TL;DR: 论文提出了一种名为FineGrainedAD的新框架,通过多级细粒度语义描述(MFSC)和多级语义对齐(MLSA)解决了小样本异常检测中视觉与语言语义对齐不足的问题,提升了异常定位性能。
Details
Motivation: 现有小样本异常检测(FSAD)方法依赖预训练的视觉-语言模型(VLMs),但由于缺乏细粒度的文本描述,仅能通过图像级描述匹配视觉补丁标记,导致语义对齐不足,性能受限。Contribution: 1. 提出了多级细粒度语义描述(MFSC),自动构建细粒度文本描述;2. 设计了FineGrainedAD框架,包括多级可学习提示(MLLP)和多级语义对齐(MLSA),提升异常定位性能。
Method: 1. MFSC通过自动管道生成多级细粒度文本描述;2. MLLP引入细粒度语义到多级可学习提示;3. MLSA设计区域聚合策略和多级对齐训练,优化语义对齐。
Result: 在MVTec-AD和VisA数据集上,FineGrainedAD在小样本设置下表现出优越的整体性能。
Insight: 细粒度的文本描述和多级语义对齐可以显著提升小样本异常检测中视觉与语言的匹配能力,从而提高定位精度。
Abstract: Few-shot anomaly detection (FSAD) methods identify anomalous regions with few known normal samples. Most existing methods rely on the generalization ability of pre-trained vision-language models (VLMs) to recognize potentially anomalous regions through feature similarity between text descriptions and images. However, due to the lack of detailed textual descriptions, these methods can only pre-define image-level descriptions to match each visual patch token to identify potential anomalous regions, which leads to the semantic misalignment between image descriptions and patch-level visual anomalies, achieving sub-optimal localization performance. To address the above issues, we propose the Multi-Level Fine-Grained Semantic Caption (MFSC) to provide multi-level and fine-grained textual descriptions for existing anomaly detection datasets with automatic construction pipeline. Based on the MFSC, we propose a novel framework named FineGrainedAD to improve anomaly localization performance, which consists of two components: Multi-Level Learnable Prompt (MLLP) and Multi-Level Semantic Alignment (MLSA). MLLP introduces fine-grained semantics into multi-level learnable prompts through automatic replacement and concatenation mechanism, while MLSA designs region aggregation strategy and multi-level alignment training to facilitate learnable prompts better align with corresponding visual regions. Experiments demonstrate that the proposed FineGrainedAD achieves superior overall performance in few-shot settings on MVTec-AD and VisA datasets.
[63] Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition
Pei Peng,MingKun Xie,Hang Hao,Tong Jin,ShengJun Huang
Main category: cs.CV
TL;DR: 该论文提出了一种基于表征级反事实校准的方法,用于解决视觉-语言模型中的对象-上下文捷径问题,提升零样本识别的可靠性和无偏性。
Details
Motivation: 视觉-语言模型(如CLIP)在零样本识别中易受对象-上下文共现关系的干扰,导致测试时性能下降。论文将其重新定义为因果推断问题,旨在通过反事实校准消除这种偏差。Contribution: 1)提出了一种表征级的反事实校准方法;2)通过合成反事实嵌入来模拟对象在不同环境中的表现;3)无需重新训练或提示设计即可提升性能。
Method: 方法包括:1)在CLIP的表示空间中估计对象和背景的期望;2)从外部数据集或文本描述中采样多样上下文,合成反事实嵌入;3)通过干预估计总直接效应,消除背景干扰。
Result: 在上下文敏感的基准测试中,该方法显著提高了最差组和平均准确率,实现了零样本识别的新SOTA。
Insight: 论文展示了轻量级反事实校准的有效性,为多模态推理的无偏性和可靠性提供了实用的因果框架。
Abstract: Object-context shortcuts remain a persistent challenge in vision-language models, undermining zero-shot reliability when test-time scenes differ from familiar training co-occurrences. We recast this issue as a causal inference problem and ask: Would the prediction remain if the object appeared in a different environment? To answer this at inference time, we estimate object and background expectations within CLIP’s representation space, and synthesize counterfactual embeddings by recombining object features with diverse alternative contexts sampled from external datasets, batch neighbors, or text-derived descriptions. By estimating the Total Direct Effect and simulating intervention, we further subtract background-only activation, preserving beneficial object-context interactions while mitigating hallucinated scores. Without retraining or prompt design, our method substantially improves both worst-group and average accuracy on context-sensitive benchmarks, establishing a new zero-shot state of the art. Beyond performance, our framework provides a lightweight representation-level counterfactual approach, offering a practical causal avenue for debiased and reliable multimodal reasoning.
[64] Counteracting Matthew Effect in Self-Improvement of LVLMs through Head-Tail Re-balancing
Xin Guo,Zhiheng Xi,Yiwen Ding,Yitao Zhai,Xiaowei Shi,Xunliang Cai,Tao Gui,Qi Zhang,Xuanjing Huang
Main category: cs.CV
TL;DR: 该论文提出了一种解决大型视觉-语言模型(LVLM)在自我提升中“马太效应”(简单任务表现优但复杂任务表现差)的方法,通过四种策略实现头尾数据再平衡。
Details
Motivation: 自我提升范式在LVLM中被广泛采纳,但存在模型在简单查询(头数据)上表现良好而在复杂查询(尾数据)上表现不佳的问题,导致优化不平衡,阻碍模型进一步改进。Contribution: 提出了四种高效策略(分布重塑和轨迹重采样),以在自我提升过程中实现头尾数据的再平衡,解决马太效应。
Method: 通过分布重塑和轨迹重采样的两种视角,提出四种具体策略,平衡简单与复杂任务的优化目标。
Result: 实验表明,这些策略显著提升了视觉推理能力,在Qwen2-VL-7B-Instruct和InternVL2.5-4B模型上平均优于基础自我提升方法3.86分。
Insight: 在自我提升过程中,平衡简单与复杂任务的学习至关重要,避免模型因倾向简单任务而忽视复杂能力的发展。
Abstract: Self-improvement has emerged as a mainstream paradigm for advancing the reasoning capabilities of large vision-language models (LVLMs), where models explore and learn from successful trajectories iteratively. However, we identify a critical issue during this process: the model excels at generating high-quality trajectories for simple queries (i.e., head data) but struggles with more complex ones (i.e., tail data). This leads to an imbalanced optimization that drives the model to prioritize simple reasoning skills, while hindering its ability to tackle more complex reasoning tasks. Over iterations, this imbalance becomes increasingly pronounced–a dynamic we term the “Matthew effect”–which ultimately hinders further model improvement and leads to performance bottlenecks. To counteract this challenge, we introduce four efficient strategies from two perspectives: distribution-reshaping and trajectory-resampling, to achieve head-tail re-balancing during the exploration-and-learning self-improvement process. Extensive experiments on Qwen2-VL-7B-Instruct and InternVL2.5-4B models across visual reasoning tasks demonstrate that our methods consistently improve visual reasoning capabilities, outperforming vanilla self-improvement by 3.86 points on average.
[65] SA$^{2}$Net: Scale-Adaptive Structure-Affinity Transformation for Spine Segmentation from Ultrasound Volume Projection Imaging
Hao Xie,Zixun Huang,Yushen Zuo,Yakun Ju,Frank H. F. Leung,N. F. Law,Kin-Man Lam,Yong-Ping Zheng,Sai Ho Ling
Main category: cs.CV
TL;DR: SA$^{2}$Net提出了一种新颖的尺度自适应结构感知网络,用于超声体积投影成像中的脊柱分割。通过尺度自适应互补策略和结构亲和性变换,结合Transformer解码器,提高了分割的准确性和鲁棒性。
Details
Motivation: 脊柱分割在智能脊柱侧弯诊断中至关重要,但现有方法难以捕捉脊柱的全局上下文和结构知识,限制了分割性能。Contribution: 1. 提出尺度自适应互补策略,学习跨维度的长距离相关性特征;2. 引入结构亲和性变换,结合Transformer解码器进行结构感知推理;3. 提出特征混合损失聚合方法,增强模型训练。
Method: 1. 尺度自适应互补策略捕捉多尺度特征;2. 结构亲和性变换利用多头部自注意力机制编码类特异性亲和性;3. 结合Transformer解码器和特征混合损失优化分割。
Result: SA$^{2}$Net在分割性能上优于现有方法,并且对不同骨干网络具有适应性,展示了其在脊柱侧弯诊断中的潜力。
Insight: 通过结合尺度自适应和结构亲和性变换,SA$^{2}$Net有效捕捉了脊柱的多尺度特征和结构知识,为医学图像分割提供了新思路。
Abstract: Spine segmentation, based on ultrasound volume projection imaging (VPI), plays a vital role for intelligent scoliosis diagnosis in clinical applications. However, this task faces several significant challenges. Firstly, the global contextual knowledge of spines may not be well-learned if we neglect the high spatial correlation of different bone features. Secondly, the spine bones contain rich structural knowledge regarding their shapes and positions, which deserves to be encoded into the segmentation process. To address these challenges, we propose a novel scale-adaptive structure-aware network (SA$^{2}$Net) for effective spine segmentation. First, we propose a scale-adaptive complementary strategy to learn the cross-dimensional long-distance correlation features for spinal images. Second, motivated by the consistency between multi-head self-attention in Transformers and semantic level affinity, we propose structure-affinity transformation to transform semantic features with class-specific affinity and combine it with a Transformer decoder for structure-aware reasoning. In addition, we adopt a feature mixing loss aggregation method to enhance model training. This method improves the robustness and accuracy of the segmentation process. The experimental results demonstrate that our SA$^{2}$Net achieves superior segmentation performance compared to other state-of-the-art methods. Moreover, the adaptability of SA$^{2}$Net to various backbones enhances its potential as a promising tool for advanced scoliosis diagnosis using intelligent spinal image analysis. The code and experimental demo are available at https://github.com/taetiseo09/SA2Net.
[66] AdSum: Two-stream Audio-visual Summarization for Automated Video Advertisement Clipping
Wen Xie,Yanjun Zhu,Gijs Overgoor,Yakov Bart,Agata Lapedriza Garcia,Sarah Ostadabbas
Main category: cs.CV
TL;DR: 论文提出了一种名为AdSum的两流视听融合框架,用于自动化视频广告剪辑,通过将视频剪辑问题转化为镜头选择问题,并强调了音频在广告中的重要性,解决了传统手动剪辑的低效问题。
Details
Motivation: 广告制作中需要为同一广告活动生成不同时长的多个版本,传统手动剪辑方法耗时耗力。现有视频摘要方法主要关注视觉内容,忽略了音频的重要性,特别是在广告中音频的关键作用。Contribution: 1. 将视频广告剪辑问题首次定义为镜头选择问题;2. 开发了两流视听融合模型,结合视觉和音频信息预测帧的重要性;3. 构建了AdSum204数据集,填补了广告领域数据集的空白。
Method: 采用两流视听融合模型,分别处理视频和音频数据,预测帧的重要性(定义为帧被选入短广告的概率)。模型结合了视觉和音频特征,优化了现有方法的不足。
Result: 在AdSum204数据集上,模型在多项指标(如平均精度、AUC、Spearman和Kendall)上均优于现有方法,证明了其有效性。
Insight: 音频在广告剪辑中具有重要作用,单纯依赖视觉信息的摘要方法不适用于广告领域。结合视听信息的模型能显著提升剪辑质量,为广告自动化剪辑提供了新思路。
Abstract: Advertisers commonly need multiple versions of the same advertisement (ad) at varying durations for a single campaign. The traditional approach involves manually selecting and re-editing shots from longer video ads to create shorter versions, which is labor-intensive and time-consuming. In this paper, we introduce a framework for automated video ad clipping using video summarization techniques. We are the first to frame video clipping as a shot selection problem, tailored specifically for advertising. Unlike existing general video summarization methods that primarily focus on visual content, our approach emphasizes the critical role of audio in advertising. To achieve this, we develop a two-stream audio-visual fusion model that predicts the importance of video frames, where importance is defined as the likelihood of a frame being selected in the firm-produced short ad. To address the lack of ad-specific datasets, we present AdSum204, a novel dataset comprising 102 pairs of 30-second and 15-second ads from real advertising campaigns. Extensive experiments demonstrate that our model outperforms state-of-the-art methods across various metrics, including Average Precision, Area Under Curve, Spearman, and Kendall.
[67] Dynamic Context-Aware Scene Reasoning Using Vision-Language Alignment in Zero-Shot Real-World Scenarios
Manjunath Prasad Holenarasipura Rajiv,B. M. Vidyavathi
Main category: cs.CV
TL;DR: 论文提出了一种动态上下文感知的场景推理框架,利用视觉-语言对齐技术解决零样本真实世界场景下的推理问题,提升模型在未见环境中的泛化能力。
Details
Motivation: 在真实世界中,AI系统常面临无标签数据的未知场景,传统场景理解模型无法泛化到新环境,限制了视觉应用在动态、非结构化场景中的部署。Contribution: 1. 提出了动态上下文感知的场景推理框架;2. 融合预训练视觉变换器和大型语言模型,对齐视觉语义与自然语言描述;3. 动态推理模块结合全局场景线索和对象级交互,提升零样本泛化能力。
Method: 1. 整合预训练视觉变换器和语言模型;2. 设计动态推理模块,结合全局与局部线索;3. 通过语言先验指导,提升对复杂环境的理解。
Result: 在COCO、Visual Genome和Open Images等零样本基准测试中,场景理解准确率提升高达18%,在模糊或杂乱场景中表现鲁棒。
Insight: 视觉与语言的协同融合是实现零样本动态场景推理的关键,框架的可扩展性和可解释性推动了真实世界应用的进步。
Abstract: In real-world environments, AI systems often face unfamiliar scenarios without labeled data, creating a major challenge for conventional scene understanding models. The inability to generalize across unseen contexts limits the deployment of vision-based applications in dynamic, unstructured settings. This work introduces a Dynamic Context-Aware Scene Reasoning framework that leverages Vision-Language Alignment to address zero-shot real-world scenarios. The goal is to enable intelligent systems to infer and adapt to new environments without prior task-specific training. The proposed approach integrates pre-trained vision transformers and large language models to align visual semantics with natural language descriptions, enhancing contextual comprehension. A dynamic reasoning module refines predictions by combining global scene cues and object-level interactions guided by linguistic priors. Extensive experiments on zero-shot benchmarks such as COCO, Visual Genome, and Open Images demonstrate up to 18% improvement in scene understanding accuracy over baseline models in complex and unseen environments. Results also show robust performance in ambiguous or cluttered scenes due to the synergistic fusion of vision and language. This framework offers a scalable and interpretable approach for context-aware reasoning, advancing zero-shot generalization in dynamic real-world settings.
[68] CATCH: A Modular Cross-domain Adaptive Template with Hook
Xinjin Li,Yulie Lu,Jinghan Cao,Yu Ma,Zhenglin Li,Yeyang Zhou
Main category: cs.CV
TL;DR: CATCH提出了一种模块化的跨领域自适应框架,通过在视觉和语言层面引入轻量级适配模块,显著提升了视觉问答(VQA)模型在跨领域任务中的泛化能力,无需重新训练主干模型。
Details
Motivation: 现有VQA模型在跨领域任务中表现不佳,传统方法依赖领域特定微调或定制化流程,成本高且不灵活。CATCH旨在提供一种轻量级、可扩展的方案。Contribution: 1. 提出了CATCH框架,包含领域分类器和双适配模块(Prompt Adapter和Visual Adapter);2. 通过统一的钩子接口动态注入模块,无需修改主干模型;3. 在多个领域VQA任务中展示了性能提升。
Method: CATCH包含两个轻量模块:领域分类器用于识别输入图像类型,双适配模块(Prompt Adapter和Visual Adapter)分别调整语言和视觉特征。通过钩子接口动态集成到主干模型中。
Result: 在MathVQA、MedVQA-RAD和ChartQA等任务中,CATCH分别提升了2.3 BLEU、2.6 VQA和3.1 ROUGE分数,证明了其有效性。
Insight: CATCH通过视觉和语言的解耦适配,提供了一种低成本、高灵活性的跨领域VQA方案,为实际应用中的多领域部署提供了可能性。
Abstract: Recent advances in Visual Question Answering (VQA) have demonstrated impressive performance in natural image domains, with models like LLaVA leveraging large language models (LLMs) for open-ended reasoning. However, their generalization degrades significantly when transferred to out-of-domain scenarios such as remote sensing, medical imaging, or math diagrams, due to large distributional shifts and the lack of effective domain adaptation mechanisms. Existing approaches typically rely on per-domain fine-tuning or bespoke pipelines, which are costly, inflexible, and not scalable across diverse tasks. In this paper, we propose CATCH, a plug-and-play framework for cross-domain adaptation that improves the generalization of VQA models while requiring minimal changes to their core architecture. Our key idea is to decouple visual and linguistic adaptation by introducing two lightweight modules: a domain classifier to identify the input image type, and a dual adapter mechanism comprising a Prompt Adapter for language modulation and a Visual Adapter for vision feature adjustment. Both modules are dynamically injected via a unified hook interface, requiring no retraining of the backbone model. Experimental results across four domain-specific VQA benchmarks demonstrate that our framework achieves consistent performance gains without retraining the backbone model, including +2.3 BLEU on MathVQA, +2.6 VQA on MedVQA-RAD, and +3.1 ROUGE on ChartQA. These results highlight that CATCH provides a scalable and extensible approach to multi-domain VQA, enabling practical deployment across diverse application domains.
[69] Emu3.5: Native Multimodal Models are World Learners
Yufeng Cui,Honghao Chen,Haoge Deng,Xu Huang,Xinghang Li,Jirong Liu,Yang Liu,Zhuoyan Luo,Jinsheng Wang,Wenxuan Wang,Yueze Wang,Chengyuan Wang,Fan Zhang,Yingli Zhao,Ting Pan,Xianduo Li,Zecheng Hao,Wenxuan Ma,Zhuo Chen,Yulong Ao,Tiejun Huang,Zhongyuan Wang,Xinlong Wang
Main category: cs.CV
TL;DR: Emu3.5是一个原生多模态世界模型,通过统一的下一token预测目标训练,支持视觉与语言的交错输入和输出,并通过强化学习和DiDA技术提升推理效率与能力。
Details
Motivation: 现有模型在多模态任务中的能力和效率有限,Emu3.5旨在通过统一的预测目标和高效的推理方法填补这一空白。Contribution: 1. 提出Emu3.5,支持原生多模态输入输出;2. 引入DiDA技术,显著提升推理效率;3. 展示强大的世界建模和泛化能力。
Method: 1. 采用统一的下一token预测目标进行预训练;2. 使用大规模强化学习进行后训练;3. 提出DiDA技术实现双向并行预测。
Result: Emu3.5在多任务中表现优异,推理效率提升20倍,性能与Gemini 2.5 Flash Image相当,部分任务更优。
Insight: 统一的预测目标和高效的推理技术是实现原生多模态能力的关键,Emu3.5展现了世界建模的潜力。
Abstract: We introduce Emu3.5, a large-scale multimodal world model that natively predicts the next state across vision and language. Emu3.5 is pre-trained end-to-end with a unified next-token prediction objective on a corpus of vision-language interleaved data containing over 10 trillion tokens, primarily derived from sequential frames and transcripts of internet videos. The model naturally accepts interleaved vision-language inputs and generates interleaved vision-language outputs. Emu3.5 is further post-trained with large-scale reinforcement learning to enhance multimodal reasoning and generation. To improve inference efficiency, we propose Discrete Diffusion Adaptation (DiDA), which converts token-by-token decoding into bidirectional parallel prediction, accelerating per-image inference by about 20x without sacrificing performance. Emu3.5 exhibits strong native multimodal capabilities, including long-horizon vision-language generation, any-to-image (X2I) generation, and complex text-rich image generation. It also exhibits generalizable world-modeling abilities, enabling spatiotemporally consistent world exploration and open-world embodied manipulation across diverse scenarios and tasks. For comparison, Emu3.5 achieves performance comparable to Gemini 2.5 Flash Image (Nano Banana) on image generation and editing tasks and demonstrates superior results on a suite of interleaved generation tasks. We open-source Emu3.5 at https://github.com/baaivision/Emu3.5 to support community research.
[70] ResMatching: Noise-Resilient Computational Super-Resolution via Guided Conditional Flow Matching
Anirban Ray,Vera Galinova,Florian Jug
Main category: cs.CV
TL;DR: ResMatching提出了一种基于引导条件流匹配的噪声鲁棒计算方法,用于计算超分辨率(CSR),在BioSR数据集上表现优异,尤其在噪声较多的情况下展现出最佳平衡。
Details
Motivation: CSR在荧光显微镜中是一个病态问题,传统方法难以应对噪声较多的低分辨率图像。Contribution: 提出了ResMatching方法,利用数据驱动的引导条件流匹配学习更强的数据先验,并提供像素级数据不确定性估计。
Method: 通过引导条件流匹配学习先验,并支持从隐式学习的后验分布中采样。
Result: 在BioSR数据集的4种生物结构中,ResMatching在数据保真度和感知真实性之间取得了最佳平衡,尤其在噪声较多的情况下表现优异。
Insight: ResMatching在噪声较多的情况下仍能有效学习先验,且其提供的不确定性估计可帮助用户识别不可靠预测。
Abstract: Computational Super-Resolution (CSR) in fluorescence microscopy has, despite being an ill-posed problem, a long history. At its very core, CSR is about finding a prior that can be used to extrapolate frequencies in a micrograph that have never been imaged by the image-generating microscope. It stands to reason that, with the advent of better data-driven machine learning techniques, stronger prior can be learned and hence CSR can lead to better results. Here, we present ResMatching, a novel CSR method that uses guided conditional flow matching to learn such improved data-priors. We evaluate ResMatching on 4 diverse biological structures from the BioSR dataset and compare its results against 7 baselines. ResMatching consistently achieves competitive results, demonstrating in all cases the best trade-off between data fidelity and perceptual realism. We observe that CSR using ResMatching is particularly effective in cases where a strong prior is hard to learn, e.g. when the given low-resolution images contain a lot of noise. Additionally, we show that ResMatching can be used to sample from an implicitly learned posterior distribution and that this distribution is calibrated for all tested use-cases, enabling our method to deliver a pixel-wise data-uncertainty term that can guide future users to reject uncertain predictions.
[71] CYPRESS: Crop Yield Prediction via Regression on Prithvi’s Encoder for Satellite Sensing
Shayan Nejadshamsi,Yuanyuan Zhang,Shadi Zaki,Brock Porth,Lysa Porth,Vahab Khoshdel
Main category: cs.CV
TL;DR: CYPRESS是一个基于深度学习的模型,利用预训练的大规模地理空间基础模型(Prithvi-EO-2.0-600M)进行连续回归任务,将多时相卫星图像转化为高分辨率的像素级产量地图,为精准农业提供可操作的工具。
Details
Motivation: 传统农作物产量预测方法缺乏可扩展性和精细度,无法满足精准农业的需求,因此需要一种能够提供高分辨率、连续输出的预测工具。Contribution: 提出CYPRESS模型,通过微调基础模型为连续回归任务,实现了高分辨率的农作物产量预测,填补了大尺度地球观测与农场决策之间的空白。
Method: 利用预训练的Prithvi-EO-2.0-600M模型进行微调,处理多时相卫星图像,生成像素级产量地图。
Result: 在加拿大草原地区的数据集上,CYPRESS表现优于现有深度学习模型,证明了基础模型在农业应用中的有效性。
Insight: 微调大规模基础模型可以显著提升专业领域任务的性能,为精准农业提供了新的技术路径。
Abstract: Accurate and timely crop yield prediction is crucial for global food security and modern agricultural management. Traditional methods often lack the scalability and granularity required for precision farming. This paper introduces CYPRESS (Crop Yield Prediction via Regression on Prithvi’s Encoder for Satellite Sensing), a deep learning model designed for high-resolution, intra-field canola yield prediction. CYPRESS leverages a pre-trained, large-scale geospatial foundation model (Prithvi-EO-2.0-600M) and adapts it for a continuous regression task, transforming multi-temporal satellite imagery into dense, pixel-level yield maps. Evaluated on a comprehensive dataset from the Canadian Prairies, CYPRESS demonstrates superior performance over existing deep learning-based yield prediction models, highlighting the effectiveness of fine-tuning foundation models for specialized agricultural applications. By providing a continuous, high-resolution output, CYPRESS offers a more actionable tool for precision agriculture than conventional classification or county-level aggregation methods. This work validates a novel approach that bridges the gap between large-scale Earth observation and on-farm decision-making, offering a scalable solution for detailed agricultural monitoring.
[72] Spiking Patches: Asynchronous, Sparse, and Efficient Tokens for Event Cameras
Christoffer Koo Øhrstrøm,Ronja Güldenring,Lazaros Nalpantidis
Main category: cs.CV
TL;DR: 该论文提出了一种名为Spiking Patches的令牌化方法,专门为事件相机设计,旨在保留事件的异步性和空间稀疏性,同时在精度上与帧和体素表示相当甚至更优,且推理速度更快。
Details
Motivation: 事件相机生成的异步和稀疏事件数据需要一种能够保留这些特性的表示方法,而现有的帧或体素表示虽精度高但牺牲了异步性和稀疏性。Contribution: 提出了Spiking Patches令牌化方法,为事件相机设计了一种保留异步性和空间稀疏性的高效表示,同时在精度和推理速度上优于传统方法。
Method: 通过令牌化事件流生成异步稀疏的令牌表示,并采用GNN、PCN和Transformer进行评估。
Result: 实验表明,Spiking Patches在推理速度上比体素快3.4倍,比帧快10.4倍,且在手势识别和目标检测任务中精度相当或更高(手势识别提升3.8,目标检测提升1.4)。
Insight: 令牌化为事件相机提供了一种新的研究方向,强调了在保持事件相机特性的同时提升效率和精度的可能性。
Abstract: We propose tokenization of events and present a tokenizer, Spiking Patches, specifically designed for event cameras. Given a stream of asynchronous and spatially sparse events, our goal is to discover an event representation that preserves these properties. Prior works have represented events as frames or as voxels. However, while these representations yield high accuracy, both frames and voxels are synchronous and decrease the spatial sparsity. Spiking Patches gives the means to preserve the unique properties of event cameras and we show in our experiments that this comes without sacrificing accuracy. We evaluate our tokenizer using a GNN, PCN, and a Transformer on gesture recognition and object detection. Tokens from Spiking Patches yield inference times that are up to 3.4x faster than voxel-based tokens and up to 10.4x faster than frames. We achieve this while matching their accuracy and even surpassing in some cases with absolute improvements up to 3.8 for gesture recognition and up to 1.4 for object detection. Thus, tokenization constitutes a novel direction in event-based vision and marks a step towards methods that preserve the properties of event cameras.
[73] PT-DETR: Small Target Detection Based on Partially-Aware Detail Focus
Bingcong Huo,Zhiming Wang
Main category: cs.CV
TL;DR: PT-DETR是针对无人机影像中小目标检测问题提出的改进算法,通过引入PADF模块和MFFF模块提升小目标特征提取能力,并结合Focaler-SIoU优化边界框匹配,显著提升了检测精度与鲁棒性。
Details
Motivation: 无人机影像中的目标检测面临复杂背景、遮挡、小目标密集和光照变化等挑战,现有方法在这些场景中表现不佳,需更高效的检测算法。Contribution: 1. 提出Partially-Aware Detail Focus (PADF)模块增强小目标特征提取;2. 设计Median-Frequency Feature Fusion (MFFF)模块优化细节与上下文信息捕捉;3. 结合Focaler-SIoU改进边界框匹配,提升对小目标的敏感性。
Method: 基于RT-DETR框架,引入PADF模块增强特征提取,利用MFFF模块融合多尺度特征,并通过Focaler-SIoU优化检测框匹配。
Result: 在VisDrone2019数据集上,PT-DETR比RT-DETR的mAP提升1.6%和1.7%,计算复杂度更低,参数更少。
Insight: 通过增强小目标的局部细节和上下文捕捉能力,可显著提升无人机影像中的目标检测性能,同时保持低计算开销。
Abstract: To address the challenges in UAV object detection, such as complex backgrounds, severe occlusion, dense small objects, and varying lighting conditions,this paper proposes PT-DETR based on RT-DETR, a novel detection algorithm specifically designed for small objects in UAV imagery. In the backbone network, we introduce the Partially-Aware Detail Focus (PADF) Module to enhance feature extraction for small objects. Additionally,we design the Median-Frequency Feature Fusion (MFFF) module,which effectively improves the model’s ability to capture small-object details and contextual information. Furthermore,we incorporate Focaler-SIoU to strengthen the model’s bounding box matching capability and increase its sensitivity to small-object features, thereby further enhancing detection accuracy and robustness. Compared with RT-DETR, our PT-DETR achieves mAP improvements of 1.6% and 1.7% on the VisDrone2019 dataset with lower computational complexity and fewer parameters, demonstrating its robustness and feasibility for small-object detection tasks.
[74] All You Need for Object Detection: From Pixels, Points, and Prompts to Next-Gen Fusion and Multimodal LLMs/VLMs in Autonomous Vehicles
Sayed Pedram Haeri Boroujeni,Niloufar Mehrabi,Hazim Alzorgan,Ahmad Sarlak,Mahlagha Fazeli,Abolfazl Razi
Main category: cs.CV
TL;DR: 这篇综述论文聚焦自动驾驶车辆中的目标检测技术,分析了多模态传感器融合、新兴视觉-语言模型(VLMs)和大型语言模型(LLMs)的应用,并提出了未来研究方向。
Details
Motivation: 自动驾驶车辆的成功依赖于复杂环境中的可靠目标检测,但现有知识在多模态感知、上下文推理和协作智能方面仍显碎片化。论文旨在填补这一空白,提供前瞻性分析。Contribution: 论文的主要贡献包括:(1)系统梳理了自动驾驶传感器的能力与局限性;(2)提出了一种结构化的数据集分类方法;(3)分析了从2D/3D检测到基于Transformer的混合传感器融合的最新技术。
Method: 论文采用了文献综述方法,重点分析了多模态传感器(如摄像头、LiDAR、雷达)、新兴VLMs/LLMs框架以及Transformer驱动的感知技术。
Result: 通过综合分析,论文揭示了当前技术的潜力、面临的挑战以及未来的研究方向,尤其是在VLM/LLM驱动的感知框架中的应用前景。
Insight: 论文强调了多模态传感器融合与生成式AI(如VLM/LLM)的结合是未来自动驾驶目标检测的关键方向,同时指出数据集的协作性和动态性是需要解决的重要挑战。
Abstract: Autonomous Vehicles (AVs) are transforming the future of transportation through advances in intelligent perception, decision-making, and control systems. However, their success is tied to one core capability, reliable object detection in complex and multimodal environments. While recent breakthroughs in Computer Vision (CV) and Artificial Intelligence (AI) have driven remarkable progress, the field still faces a critical challenge as knowledge remains fragmented across multimodal perception, contextual reasoning, and cooperative intelligence. This survey bridges that gap by delivering a forward-looking analysis of object detection in AVs, emphasizing emerging paradigms such as Vision-Language Models (VLMs), Large Language Models (LLMs), and Generative AI rather than re-examining outdated techniques. We begin by systematically reviewing the fundamental spectrum of AV sensors (camera, ultrasonic, LiDAR, and Radar) and their fusion strategies, highlighting not only their capabilities and limitations in dynamic driving environments but also their potential to integrate with recent advances in LLM/VLM-driven perception frameworks. Next, we introduce a structured categorization of AV datasets that moves beyond simple collections, positioning ego-vehicle, infrastructure-based, and cooperative datasets (e.g., V2V, V2I, V2X, I2I), followed by a cross-analysis of data structures and characteristics. Ultimately, we analyze cutting-edge detection methodologies, ranging from 2D and 3D pipelines to hybrid sensor fusion, with particular attention to emerging transformer-driven approaches powered by Vision Transformers (ViTs), Large and Small Language Models (SLMs), and VLMs. By synthesizing these perspectives, our survey delivers a clear roadmap of current capabilities, open challenges, and future opportunities.
[75] Towards Reliable Sea Ice Drift Estimation in the Arctic Deep Learning Optical Flow on RADARSAT-2
Daniela Martin,Joseph Gallego
Main category: cs.CV
TL;DR: 该论文首次在RADARSAT-2 ScanSAR海冰图像上对48种深度学习光流模型进行了大规模基准测试,证明其可用于准确估计北极海冰漂移,为导航和气候建模提供了新机会。
Details
Motivation: 传统的海冰漂移估计方法依赖数学模型和强假设,精度受限。深度学习光流方法在计算机视觉中表现出色,但其在地球物理问题和SAR图像中的应用尚未充分探索。Contribution: 1) 首次大规模评估深度学习光流模型在海冰漂移估计中的表现;2) 证明这些模型在SAR图像上可以达到亚公里级精度(EPE 300-400米);3) 展示了模型捕捉区域漂移模式的能力。
Method: 使用48种深度学习光流模型对RADARSAT-2 ScanSAR海冰图像进行测试,评估指标包括端点误差(EPE)和Fl-all,并与GNSS跟踪浮标数据进行对比。
Result: 多个模型达到亚公里级精度(EPE 6-8像素,300-400米),显著优于传统方法,并能生成空间连续的运动场。
Insight: 深度学习光流方法可以成功迁移到极地遥感中,为海冰漂移研究提供高精度且空间连续的估计,有助于导航和气候建模。
Abstract: Accurate estimation of sea ice drift is critical for Arctic navigation, climate research, and operational forecasting. While optical flow, a computer vision technique for estimating pixel wise motion between consecutive images, has advanced rapidly in computer vision, its applicability to geophysical problems and to satellite SAR imagery remains underexplored. Classical optical flow methods rely on mathematical models and strong assumptions about motion, which limit their accuracy in complex scenarios. Recent deep learning based approaches have substantially improved performance and are now the standard in computer vision, motivating their application to sea ice drift estimation. We present the first large scale benchmark of 48 deep learning optical flow models on RADARSAT 2 ScanSAR sea ice imagery, evaluated with endpoint error (EPE) and Fl all metrics against GNSS tracked buoys. Several models achieve sub kilometer accuracy (EPE 6 to 8 pixels, 300 to 400 m), a small error relative to the spatial scales of sea ice motion and typical navigation requirements in the Arctic. Our results demonstrate that the models are capable of capturing consistent regional drift patterns and that recent deep learning based optical flow methods, which have substantially improved motion estimation accuracy compared to classical methods, can be effectively transferred to polar remote sensing. Optical flow produces spatially continuous drift fields, providing motion estimates for every image pixel rather than at sparse buoy locations, offering new opportunities for navigation and climate modeling.
[76] Improving Classification of Occluded Objects through Scene Context
Courtney M. King,Daniel D. Leeds,Damian Lyons,George Kalaitzis
Main category: cs.CV
TL;DR: 该论文提出两种基于场景信息的融合方法,用于提升RPN-DCNN网络在遮挡物体分类中的性能。方法一是根据背景场景选择网络,方法二是在检测后融合场景知识。实验表明,结合遮挡和无遮挡图像的训练方法效果最佳。
Details
Motivation: 遮挡物体识别是计算机视觉中的挑战性问题,现有算法在此类场景下表现不佳。生物视觉中场景上下文有助于物体识别,因此作者希望通过融合场景信息提升模型的鲁棒性。Contribution: 1. 提出两种融合场景信息的方法;2. 实验证明结合遮挡和无遮挡数据的训练策略更有效;3. 方法可解释且易于扩展。
Method: 1. 方法一:检测前根据背景场景选择定制网络;2. 方法二:检测后将场景知识融入RPN的初始得分。
Result: 在部分遮挡数据集上,方法和基线相比显著提升了召回率和精确率。
Insight: 结合遮挡和无遮挡数据的训练策略优于单一策略,未来可进一步研究场景信息的深度利用。
Abstract: The presence of occlusions has provided substantial challenges to typically-powerful object recognition algorithms. Additional sources of information can be extremely valuable to reduce errors caused by occlusions. Scene context is known to aid in object recognition in biological vision. In this work, we attempt to add robustness into existing Region Proposal Network-Deep Convolutional Neural Network (RPN-DCNN) object detection networks through two distinct scene-based information fusion techniques. We present one algorithm under each methodology: the first operates prior to prediction, selecting a custom object network to use based on the identified background scene, and the second operates after detection, fusing scene knowledge into initial object scores output by the RPN. We demonstrate our algorithms on challenging datasets featuring partial occlusions, which show overall improvement in both recall and precision against baseline methods. In addition, our experiments contrast multiple training methodologies for occlusion handling, finding that training on a combination of both occluded and unoccluded images demonstrates an improvement over the others. Our method is interpretable and can easily be adapted to other datasets, offering many future directions for research and practical applications.
[77] Process Integrated Computer Vision for Real-Time Failure Prediction in Steel Rolling Mill
Vaibhav Kurrey,Sivakalyan Pujari,Gagan Raj Gupta
Main category: cs.CV
TL;DR: 该论文介绍了一个基于机器视觉的异常检测系统,用于钢铁轧机中的实时故障预测,通过结合工业摄像头和深度学习模型,实现了设备故障的早期预测和维护。
Details
Motivation: 钢铁轧机中的设备故障和生产中断会导致高额的非计划停机成本,因此需要一种实时、可靠的故障预测方法以减少损失。Contribution: 主要贡献是提出了一种集成工业摄像头和深度学习模型的实时故障预测系统,能够在分布式生产线上高效部署,并结合传感器数据进行故障定位和根因分析。
Method: 系统使用工业摄像头实时采集生产线视频,通过集中的视频服务器运行深度学习模型进行异常检测,同时结合传感器数据进行联合分析。
Result: 该系统能够提前预测设备故障,提供可操作的维护建议,从而提高生产线的可靠性和盈利能力。
Insight: 融合视觉数据和传感器数据的集成方法在工业环境中具有显著的应用潜力,能够实现主动维护和成本优化。
Abstract: We present a long-term deployment study of a machine vision-based anomaly detection system for failure prediction in a steel rolling mill. The system integrates industrial cameras to monitor equipment operation, alignment, and hot bar motion in real time along the process line. Live video streams are processed on a centralized video server using deep learning models, enabling early prediction of equipment failures and process interruptions, thereby reducing unplanned breakdown costs. Server-based inference minimizes the computational load on industrial process control systems (PLCs), supporting scalable deployment across production lines with minimal additional resources. By jointly analyzing sensor data from data acquisition systems and visual inputs, the system identifies the location and probable root causes of failures, providing actionable insights for proactive maintenance. This integrated approach enhances operational reliability, productivity, and profitability in industrial manufacturing environments.
[78] The Impact and Outlook of 3D Gaussian Splatting
Bernhard Kerbl
Main category: cs.CV
TL;DR: 本文总结了3D高斯泼溅(3DGS)在3D场景表示领域的重要进展,包括效率提升、动态表示发展、数学基础探索,以及在移动设备和虚拟现实中的应用。
Details
Motivation: 3DGS作为一种突破性的3D表示方法,激发了大量后续研究,但其在效率、动态性和应用场景上仍有改进空间。Contribution: 本文综述了3DGS的多个关键研究方向,展示了其从基础表示发展为3D视觉和图形领域的多功能工具。
Method: 总结了通过高效训练与渲染、动态表示(4DGS)、数学基础探索等方法对3DGS的改进。
Result: 3DGS在资源效率、动态场景处理和大规模环境中的应用取得了显著进展,并实现了近即时辐射场重建。
Insight: 3DGS展现了作为3D技术的核心工具的潜力,未来可能进一步推动移动和虚拟现实等领域的发展。
Abstract: Since its introduction, 3D Gaussian Splatting (3DGS) has rapidly transformed the landscape of 3D scene representations, inspiring an extensive body of associated research. Follow-up work includes analyses and contributions that enhance the efficiency, scalability, and real-world applicability of 3DGS. In this summary, we present an overview of several key directions that have emerged in the wake of 3DGS. We highlight advances enabling resource-efficient training and rendering, the evolution toward dynamic (or four-dimensional, 4DGS) representations, and deeper exploration of the mathematical foundations underlying its appearance modeling and rendering process. Furthermore, we examine efforts to bring 3DGS to mobile and virtual reality platforms, its extension to massive-scale environments, and recent progress toward near-instant radiance field reconstruction via feed-forward or distributed computation. Collectively, these developments illustrate how 3DGS has evolved from a breakthrough representation into a versatile and foundational tool for 3D vision and graphics.
[79] SteerVLM: Robust Model Control through Lightweight Activation Steering for Vision Language Models
Anushka Sivakumar,Andrew Zhang,Zaber Hakim,Chris Thomas
Main category: cs.CV
TL;DR: SteerVLM提出了一种轻量级的激活引导模块,通过动态调整语言模态与图像上下文之间的激活,实现对Vision-Language Models(VLMs)输出的精细控制,同时保持非目标任务性能。
Details
Motivation: 现有的VLMs在输出控制上缺乏灵活性,难以在不修改模型权重的情况下实现对复杂语义的精细控制,且现有方法通常需要大量参数或手动调优。Contribution: 1. 提出了轻量级的SteerVLM模块,仅需学习原始VLM参数的0.14%;2. 引入了VNIA数据集,用于开发和评估VLM引导技术;3. 在VLM引导和幻觉缓解任务上优于现有方法。
Method: 1. 通过成对提示的潜在嵌入学习动态激活调整;2. 采用逐维度激活调制和跨层自适应引导,无需静态向量或手动调优;3. 利用VNIA数据集进行开发和评估。
Result: SteerVLM在VLM引导和幻觉缓解任务上表现优于现有方法,同时保持了非目标任务的性能。
Insight: 激活工程技术可以作为一种轻量化、高效的方法,实现对多模态模型的精细控制,而无需修改原始模型权重。
Abstract: This work introduces SteerVLM, a lightweight steering module designed to guide Vision-Language Models (VLMs) towards outputs that better adhere to desired instructions. Our approach learns from the latent embeddings of paired prompts encoding target and converse behaviors to dynamically adjust activations connecting the language modality with image context. This allows for fine-grained, inference-time control over complex output semantics without modifying model weights while preserving performance on off-target tasks. Our steering module requires learning parameters equal to 0.14% of the original VLM’s size. Our steering module gains model control through dimension-wise activation modulation and adaptive steering across layers without requiring pre-extracted static vectors or manual tuning of intervention points. Furthermore, we introduce VNIA (Visual Narrative Intent Alignment), a multimodal dataset specifically created to facilitate the development and evaluation of VLM steering techniques. Our method outperforms existing intervention techniques on steering and hallucination mitigation benchmarks for VLMs and proposes a robust solution for multimodal model control through activation engineering.
[80] Surpassing state of the art on AMD area estimation from RGB fundus images through careful selection of U-Net architectures and loss functions for class imbalance
Valentyna Starodub,Mantas Lukoševičius
Main category: cs.CV
TL;DR: 该论文提出了一种基于U-Net架构和改进损失函数的语义分割方法,用于从RGB眼底图像中检测年龄相关性黄斑变性(AMD)病变,在ADAM挑战赛基准上取得了超越现有最佳结果的性能。
Details
Motivation: AMD是60岁以上人群不可逆视力障碍的主要原因之一,而RGB眼底图像是一种非侵入性且成本效益高的成像技术。该研究旨在通过改进语义分割方法,提高AMD病变检测的准确性。Contribution: 论文的主要贡献包括:1)系统地评估和比较了多种U-Net架构和损失函数;2)提出了一种针对类别不平衡问题的改进训练策略;3)在ADAM挑战赛基准上实现了多类AMD病变分割的最佳性能。
Method: 研究方法基于U-Net架构,通过以下改进提升性能:1)采用不同复杂度的编码器(backbone)网络;2)引入针对图像和像素级别类别不平衡的专用损失函数;3)优化预处理技术。
Result: 最终提出的框架在ADAM挑战赛的多类AMD病变分割任务中超越了之前的所有提交结果,证明了其有效性。
Insight: 论文表明,针对类别不平衡问题精心设计损失函数和选择合适的网络架构,可以显著提升语义分割任务的性能,尤其是在医学图像分析领域。
Abstract: Age-related macular degeneration (AMD) is one of the leading causes of irreversible vision impairment in people over the age of 60. This research focuses on semantic segmentation for AMD lesion detection in RGB fundus images, a non-invasive and cost-effective imaging technique. The results of the ADAM challenge - the most comprehensive AMD detection from RGB fundus images research competition and open dataset to date - serve as a benchmark for our evaluation. Taking the U-Net connectivity as a base of our framework, we evaluate and compare several approaches to improve the segmentation model’s architecture and training pipeline, including pre-processing techniques, encoder (backbone) deep network types of varying complexity, and specialized loss functions to mitigate class imbalances on image and pixel levels. The main outcome of this research is the final configuration of the AMD detection framework, which outperforms all the prior ADAM challenge submissions on the multi-class segmentation of different AMD lesion types in non-invasive RGB fundus images. The source code used to conduct the experiments presented in this paper is made freely available.
[81] ChartAB: A Benchmark for Chart Grounding & Dense Alignment
Aniruddh Bansal,Davit Soselia,Dang Nguyen,Tianyi Zhou
Main category: cs.CV
TL;DR: 这篇论文提出了一个名为ChartAB的新基准,用于全面评估视觉语言模型(VLMs)在图表对齐与密集任务中的表现,包括提取表格数据、定位可视化元素和识别图表属性。通过设计JSON模板和引入两阶段推理流程,论文揭示了VLMs在图表的细粒度理解中的偏差和局限性。
Details
Motivation: 现有的VLMs在图表细节感知和细粒度结构提取方面表现不佳,限制了其在多图表比较和推理任务中的能力。因此,作者提出了ChartAB基准,以填补这一空白并推动相关技术的发展。Contribution: 1. 提出了ChartAB基准,专注于图表对齐与密集任务的评估;2. 设计了JSON模板用于定制化的任务指标计算;3. 引入了两阶段推理流程,支持跨图表的元素与属性对齐评估;4. 揭示了VLMs在图表理解中的感知偏见、弱点、鲁棒性和幻觉问题。
Method: 1. 构建多样化和复杂化的图表数据集;2. 设计JSON模板以标准化评估指标的生成;3. 采用两阶段推理流程(第一阶段为单图表理解,第二阶段为跨图表对齐);4. 在多个VLMs上进行实验分析。
Result: 实验结果展示了VLMs在图表理解任务中的细粒度差异,并指出了它们在感知偏见、幻觉和鲁棒性方面的局限性。这些结果为改进VLMs提供了具体方向。
Insight: 1. VLMs在图表理解中的表现受限于细节感知能力;2. 两阶段推理流程能有效支持跨图表的对齐任务;3. 当前模型需要在特定技能(如细粒度对齐)上加强训练。
Abstract: Charts play an important role in visualization, reasoning, data analysis, and the exchange of ideas among humans. However, existing vision-language models (VLMs) still lack accurate perception of details and struggle to extract fine-grained structures from charts. Such limitations in chart grounding also hinder their ability to compare multiple charts and reason over them. In this paper, we introduce a novel “ChartAlign Benchmark (ChartAB)” to provide a comprehensive evaluation of VLMs in chart grounding tasks, i.e., extracting tabular data, localizing visualization elements, and recognizing various attributes from charts of diverse types and complexities. We design a JSON template to facilitate the calculation of evaluation metrics specifically tailored for each grounding task. By incorporating a novel two-stage inference workflow, the benchmark can further evaluate VLMs’ capability to align and compare elements/attributes across two charts. Our analysis of evaluations on several recent VLMs reveals new insights into their perception biases, weaknesses, robustness, and hallucinations in chart understanding. These findings highlight the fine-grained discrepancies among VLMs in chart understanding tasks and point to specific skills that need to be strengthened in current models.
[82] HEIR: Learning Graph-Based Motion Hierarchies
Cheng Zheng,William Koch,Baiang Li,Felix Heide
Main category: cs.CV
TL;DR: 本文提出了一种基于图学习的层级运动建模方法(HEIR),通过数据驱动的方式直接从数据中学习可解释的运动层级结构,适用于多种运动建模任务。
Details
Motivation: 现有的运动层级建模方法通常依赖手动定义或启发式的固定运动基元,限制了其泛化能力。本文旨在通过数据驱动的方法学习更具适应性和解释性的运动层级结构。Contribution: 1. 提出了一种通用的基于图的层级运动建模方法;2. 将层级推断转化为可微的图学习问题;3. 在1D、2D和3D动态场景中验证了方法的有效性。
Method: 使用图神经网络表示运动层级,顶点表示基本运动,有向边捕捉学习的父子依赖关系,将全局绝对运动分解为继承模式和局部残差。
Result: 实验表明,该方法在1D和2D运动中能重构固有运动层级,在3D高斯抛雪球场景中生成更真实和可解释的形变。
Insight: 数据驱动的层级学习提供了更灵活的运动建模范式,适用于广泛的运动相关任务。
Abstract: Hierarchical structures of motion exist across research fields, including computer vision, graphics, and robotics, where complex dynamics typically arise from coordinated interactions among simpler motion components. Existing methods to model such dynamics typically rely on manually-defined or heuristic hierarchies with fixed motion primitives, limiting their generalizability across different tasks. In this work, we propose a general hierarchical motion modeling method that learns structured, interpretable motion relationships directly from data. Our method represents observed motions using graph-based hierarchies, explicitly decomposing global absolute motions into parent-inherited patterns and local motion residuals. We formulate hierarchy inference as a differentiable graph learning problem, where vertices represent elemental motions and directed edges capture learned parent-child dependencies through graph neural networks. We evaluate our hierarchical reconstruction approach on three examples: 1D translational motion, 2D rotational motion, and dynamic 3D scene deformation via Gaussian splatting. Experimental results show that our method reconstructs the intrinsic motion hierarchy in 1D and 2D cases, and produces more realistic and interpretable deformations compared to the baseline on dynamic 3D Gaussian splatting scenes. By providing an adaptable, data-driven hierarchical modeling paradigm, our method offers a formulation applicable to a broad range of motion-centric tasks. Project Page: https://light.princeton.edu/HEIR/
[83] The Quest for Generalizable Motion Generation: Data, Model, and Evaluation
Jing Lin,Ruisi Wang,Junzhe Lu,Ziqi Huang,Guorui Song,Ailing Zeng,Xian Liu,Chen Wei,Wanqi Yin,Qingping Sun,Zhongang Cai,Lei Yang,Ziwei Liu
Main category: cs.CV
TL;DR: 该论文提出了一套全面的框架ViMoGen,旨在将视频生成(ViGen)的知识迁移到3D人体运动生成(MoGen),以解决后者泛化能力不足的问题。通过构建大规模数据集ViMoGen-228K、提出流匹配扩散Transformer模型ViMoGen及其轻量版ViMoGen-light,以及设计分层评估基准MBench,显著提升了运动生成的质量和泛化能力。
Details
Motivation: 现有的3D人体运动生成模型在标准基准上表现良好,但其泛化能力仍然有限。相比之下,视频生成领域在建模人类行为方面展示了更强的泛化能力。因此,论文希望通过迁移视频生成的知识,提升运动生成模型的泛化性。Contribution: 1. 构建了ViMoGen-228K数据集,包含22.8万个高质量运动样本,结合了动作捕捉数据和视频标注语义。2. 提出了ViMoGen模型,基于流匹配扩散Transformer,统一了动作捕捉和视频生成的先验知识。3. 设计了ViMoGen-light,一个轻量化的变体,保持了强泛化能力。4. 提出了MBench分层基准,用于细粒度评估运动生成质量、提示忠实度和泛化能力。
Method: 1. 数据集构建:结合动作捕捉数据、视频标注语义和ViGen合成样本。2. 模型设计:ViMoGen采用基于流匹配的扩散Transformer,通过门控多模态条件统一先验知识;ViMoGen-light通过蒸馏消除视频生成依赖。3. 评估方法:MBench从多个层次评估生成结果。
Result: 实验表明,ViMoGen框架在自动评估和人工评估中均显著优于现有方法,展现了更强的泛化能力和运动生成质量。
Insight: 通过迁移视频生成的知识,可以有效提升3D运动生成模型的泛化能力;多模态数据和大规模标注是实现这一目标的关键。此外,轻量化的模型设计可以在保持性能的同时降低计算成本。
Abstract: Despite recent advances in 3D human motion generation (MoGen) on standard benchmarks, existing models still face a fundamental bottleneck in their generalization capability. In contrast, adjacent generative fields, most notably video generation (ViGen), have demonstrated remarkable generalization in modeling human behaviors, highlighting transferable insights that MoGen can leverage. Motivated by this observation, we present a comprehensive framework that systematically transfers knowledge from ViGen to MoGen across three key pillars: data, modeling, and evaluation. First, we introduce ViMoGen-228K, a large-scale dataset comprising 228,000 high-quality motion samples that integrates high-fidelity optical MoCap data with semantically annotated motions from web videos and synthesized samples generated by state-of-the-art ViGen models. The dataset includes both text-motion pairs and text-video-motion triplets, substantially expanding semantic diversity. Second, we propose ViMoGen, a flow-matching-based diffusion transformer that unifies priors from MoCap data and ViGen models through gated multimodal conditioning. To enhance efficiency, we further develop ViMoGen-light, a distilled variant that eliminates video generation dependencies while preserving strong generalization. Finally, we present MBench, a hierarchical benchmark designed for fine-grained evaluation across motion quality, prompt fidelity, and generalization ability. Extensive experiments show that our framework significantly outperforms existing approaches in both automatic and human evaluations. The code, data, and benchmark will be made publicly available.
[84] SEE4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting
Dongyue Lu,Ao Liang,Tianxin Huang,Xiao Fu,Yuyang Zhao,Baorui Ma,Liang Pan,Wei Yin,Lingdong Kong,Wei Tsang Ooi,Ziwei Liu
Main category: cs.CV
TL;DR: SEE4D提出了一种无需姿态标注的4D内容生成方法,通过自动回归视频修复技术,避免了传统方法对相机姿态的依赖,提升了在复杂场景下的表现。
Details
Motivation: 现有的视频生成4D内容方法通常需要手动标注相机姿态,成本高且难以适应野外观测数据。SEE4D旨在解决这一问题,实现无需姿态标注的高效4D内容生成。Contribution: 1. 提出了轨迹到相机的框架(trajectory-to-camera),用虚拟相机取代显式轨迹预测;2. 设计了基于视图条件的视频修复模型,学习几何先验;3. 提出了时空自回归推理流程,支持连贯的多视角生成。
Method: 结合虚拟相机和视频修复技术:1. 将输入帧映射到固定的虚拟相机;2. 训练视频修复模型填补缺失区域;3. 通过自回归推理扩展视频内容。
Result: 在跨视角视频生成和稀疏重建任务中,SEE4D优于依赖姿态或轨迹的基线方法,验证了其泛化能力和性能优势。
Insight: 通过分离相机控制与场景建模,SEE4D简化了4D生成的复杂性,同时利用视频修复学习几何先验,避免了显式的3D标注需求。
Abstract: Immersive applications call for synthesizing spatiotemporal 4D content from casual videos without costly 3D supervision. Existing video-to-4D methods typically rely on manually annotated camera poses, which are labor-intensive and brittle for in-the-wild footage. Recent warp-then-inpaint approaches mitigate the need for pose labels by warping input frames along a novel camera trajectory and using an inpainting model to fill missing regions, thereby depicting the 4D scene from diverse viewpoints. However, this trajectory-to-trajectory formulation often entangles camera motion with scene dynamics and complicates both modeling and inference. We introduce SEE4D, a pose-free, trajectory-to-camera framework that replaces explicit trajectory prediction with rendering to a bank of fixed virtual cameras, thereby separating camera control from scene modeling. A view-conditional video inpainting model is trained to learn a robust geometry prior by denoising realistically synthesized warped images and to inpaint occluded or missing regions across virtual viewpoints, eliminating the need for explicit 3D annotations. Building on this inpainting core, we design a spatiotemporal autoregressive inference pipeline that traverses virtual-camera splines and extends videos with overlapping windows, enabling coherent generation at bounded per-step complexity. We validate See4D on cross-view video generation and sparse reconstruction benchmarks. Across quantitative metrics and qualitative assessments, our method achieves superior generalization and improved performance relative to pose- or trajectory-conditioned baselines, advancing practical 4D world modeling from casual videos.
[85] Masked Diffusion Captioning for Visual Feature Learning
Chao Feng,Zihao Wei,Andrew Owens
Main category: cs.CV
TL;DR: 本文提出了一种称为掩码扩散描述(MDC)的方法,通过图像条件的掩码扩散语言模型学习视觉特征,并在下游视觉任务中表现出色。
Details
Motivation: 现有的视觉特征学习方法通常依赖于自回归描述或对比学习,但这些方法可能受限于序列位置或需要额外的辅助目标。MDC旨在提供一种不依赖序列位置的学习信号。Contribution: 提出了MDC,一种新的视觉特征学习框架,通过掩码扩散描述从图像文本对中学习视觉特征。该方法避免了自回归方法的局限性。
Method: 使用图像条件的掩码扩散语言模型,随机掩码图像描述中的文本标记,并通过解码器重构原始文本来学习视觉特征。
Result: 线性探测实验表明,MDC学习的视觉特征在多种学术规模模型和数据集上与自回归和对比学习方法相当。
Insight: MDC通过掩码扩散机制提供了一种更高效的视觉特征学习方式,减少了对外部辅助目标的依赖。
Abstract: We learn visual features by captioning images with an image-conditioned masked diffusion language model, a formulation we call masked diffusion captioning (MDC). During training, text tokens in each image-caption pair are masked at a randomly chosen ratio, and a decoder conditioned on visual features is trained to reconstruct the original text. After training, the learned visual features can be applied to downstream vision tasks. Unlike autoregressive captioning, the strength of the visual learning signal in MDC does not depend on each token’s position in the sequence, reducing the need for auxiliary objectives. Linear probing experiments across a variety of academic-scale models and datasets show that the learned visual features are competitive with those produced by autoregressive and contrastive approaches.
[86] OmniX: From Unified Panoramic Generation and Perception to Graphics-Ready 3D Scenes
Yukun Huang,Jiwen Yu,Yanning Zhou,Jianan Wang,Xintao Wang,Pengfei Wan,Xihui Liu
Main category: cs.CV
TL;DR: OmniX 是一个多功能统一框架,通过重用 2D 生成先验,实现了全景感知、生成和补全,并能生成可用于物理渲染(PBR)的 3D 场景。
Details
Motivation: 传统的 2D 提升方法主要关注外观生成,忽略了内在属性的感知。OmniX 旨在填补这一空白,生成适用于 PBR 的 3D 场景。Contribution: 1. 提出了 OmniX 框架,通过轻量级跨模态适配器结构实现全景感知和生成;2. 构建了一个大规模合成全景数据集。
Method: 利用 2D 生成模型的先验知识,通过跨模态适配器实现全景感知和生成,支持几何、纹理和 PBR 材料的统一处理。
Result: 实验表明,OmniX 在全景视觉感知和 3D 场景生成方面表现优异,支持物理真实的虚拟世界生成。
Insight: 2D 生成模型可以高效地扩展到全景感知任务,为 3D 场景生成提供了新的可能性。
Abstract: There are two prevalent ways to constructing 3D scenes: procedural generation and 2D lifting. Among them, panorama-based 2D lifting has emerged as a promising technique, leveraging powerful 2D generative priors to produce immersive, realistic, and diverse 3D environments. In this work, we advance this technique to generate graphics-ready 3D scenes suitable for physically based rendering (PBR), relighting, and simulation. Our key insight is to repurpose 2D generative models for panoramic perception of geometry, textures, and PBR materials. Unlike existing 2D lifting approaches that emphasize appearance generation and ignore the perception of intrinsic properties, we present OmniX, a versatile and unified framework. Based on a lightweight and efficient cross-modal adapter structure, OmniX reuses 2D generative priors for a broad range of panoramic vision tasks, including panoramic perception, generation, and completion. Furthermore, we construct a large-scale synthetic panorama dataset containing high-quality multimodal panoramas from diverse indoor and outdoor scenes. Extensive experiments demonstrate the effectiveness of our model in panoramic visual perception and graphics-ready 3D scene generation, opening new possibilities for immersive and physically realistic virtual world generation.
[87] Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark
Ziyu Guo,Xinyan Chen,Renrui Zhang,Ruichuan An,Yu Qi,Dongzhi Jiang,Xiangtai Li,Manyuan Zhang,Hongsheng Li,Pheng-Ann Heng
Main category: cs.CV
TL;DR: 该论文通过MME-CoF基准对视频模型(如Veo-3)的零样本推理能力进行了全面评估,发现其在短期空间一致性和局部动态方面表现良好,但在长期因果推理和抽象逻辑方面仍存在局限。
Details
Motivation: 研究视频生成模型是否具备零样本推理能力,尤其是在复杂的视觉推理任务中,从而揭示其在现实应用中的潜力与局限性。Contribution: 1)提出了MME-CoF基准,用于系统评估视频模型的链式帧推理能力;2)从12个维度对Veo-3模型进行了深入分析,揭示了其优势和局限。
Method: 通过MME-CoF基准,对Veo-3模型在不同推理维度(如空间、几何、物理、时间和逻辑)上的表现进行系统性评估。
Result: 视频模型在短期空间一致性和局部动态方面表现优异,但在长期因果推理、严格几何约束和抽象逻辑方面表现不佳。
Insight: 虽然视频模型尚不能作为独立的零样本推理器,但可作为专用推理模型的补充视觉引擎,显示出潜在的应用价值。
Abstract: Recent video generation models can produce high-fidelity, temporally coherent videos, indicating that they may encode substantial world knowledge. Beyond realistic synthesis, they also exhibit emerging behaviors indicative of visual perception, modeling, and manipulation. Yet, an important question still remains: Are video models ready to serve as zero-shot reasoners in challenging visual reasoning scenarios? In this work, we conduct an empirical study to comprehensively investigate this question, focusing on the leading and popular Veo-3. We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic, systematically characterizing both its strengths and failure modes. To standardize this study, we curate the evaluation data into MME-CoF, a compact benchmark that enables in-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our findings reveal that while current video models demonstrate promising reasoning patterns on short-horizon spatial coherence, fine-grained grounding, and locally consistent dynamics, they remain limited in long-horizon causal reasoning, strict geometric constraints, and abstract logic. Overall, they are not yet reliable as standalone zero-shot reasoners, but exhibit encouraging signs as complementary visual engines alongside dedicated reasoning models. Project page: https://video-cof.github.io
cs.CR [Back]
[88] SIRAJ: Diverse and Efficient Red-Teaming for LLM Agents via Distilled Structured Reasoning
Kaiwen Zhou,Ahmed Elgohary,A S M Iftekhar,Amin Saied
Main category: cs.CR
TL;DR: SIRAJ 是一个通用的红队测试框架,用于评估黑盒 LLM 代理的安全性。通过动态的两步流程:生成多样化的种子测试用例和基于模型的反向攻击迭代优化。此外,采用知识蒸馏方法训练小型模型,成本更低但效果相当。实验表明,该方法在风险覆盖范围和攻击成功率上表现优异。
Details
Motivation: LLM 代理的工具调用能力带来了新的安全风险,因此需要一个全面的红队测试系统来发现漏洞并确保安全部署。Contribution: 1. 提出了 SIRAJ 框架,支持多样化的红队测试;2. 通过蒸馏结构化推理训练高效的小型模型;3. 在风险覆盖和攻击成功率上显著提升。
Method: 1. 动态两步流程:生成多样化种子测试用例和迭代优化攻击;2. 知识蒸馏方法训练小型红队测试模型;3. 结合执行轨迹和结构化推理。
Result: 生成的种子测试用例覆盖风险结果的能力提升 2-2.5 倍;8B 蒸馏模型的攻击成功率提高 100%,优于 671B 大模型。
Insight: 结构化推理和迭代优化的结合能够显著提升红队测试的效率和效果,同时小型蒸馏模型可以在低成本下取得与大模型相当的性能。
Abstract: The ability of LLM agents to plan and invoke tools exposes them to new safety risks, making a comprehensive red-teaming system crucial for discovering vulnerabilities and ensuring their safe deployment. We present SIRAJ: a generic red-teaming framework for arbitrary black-box LLM agents. We employ a dynamic two-step process that starts with an agent definition and generates diverse seed test cases that cover various risk outcomes, tool-use trajectories, and risk sources. Then, it iteratively constructs and refines model-based adversarial attacks based on the execution trajectories of former attempts. To optimize the red-teaming cost, we present a model distillation approach that leverages structured forms of a teacher model’s reasoning to train smaller models that are equally effective. Across diverse evaluation agent settings, our seed test case generation approach yields 2 – 2.5x boost to the coverage of risk outcomes and tool-calling trajectories. Our distilled 8B red-teamer model improves attack success rate by 100%, surpassing the 671B Deepseek-R1 model. Our ablations and analyses validate the effectiveness of the iterative framework, structured reasoning, and the generalization of our red-teamer models.
cs.DB [Back]
[89] Rethinking Text-to-SQL: Dynamic Multi-turn SQL Interaction for Real-world Database Exploration
Linzhuang Sun,Tianyu Guo,Hao Liang,Yuying Li,Qifeng Cai,Jingxuan Wei,Bihui Yu,Wentao Zhang,Bin Cui
Main category: cs.DB
TL;DR: 该论文针对现实世界中的动态多轮SQL交互问题,提出了DySQL-Bench基准测试和一个多轮评估框架,揭示了当前Text-to-SQL模型在处理动态用户意图时的局限性。
Details
Motivation: 传统Text-to-SQL模型在静态单轮任务中表现良好,但在真实场景中用户意图会动态变化,需多轮迭代调整查询条件或维度,现有模型无法满足此需求。Contribution: 1. 提出DySQL-Bench基准测试,通过自动化任务生成和专家验证构建动态多轮SQL交互数据集;2. 提出多轮评估框架模拟真实交互场景;3. 验证结果表明当前(如GPT-4o)模型在该任务中的表现较差。
Method: 1. 使用结构化树表示和LLM生成任务;2. 通过交互导向的过滤和专家验证确保数据质量;3. 设计多轮评估框架,模拟用户、模型和数据库的交互。
Result: DySQL-Bench覆盖13个领域,包含1,072个任务;GPT-4o的总体准确率仅为58.34%,Pass@5指标为23.81%。
Insight: 真实世界中的动态多轮SQL交互是一个极具挑战性的问题,现有Text-to-SQL模型需进一步优化以适应动态意图变化。
Abstract: Recent advances in Text-to-SQL have achieved strong results in static, single-turn tasks, where models generate SQL queries from natural language questions. However, these systems fall short in real-world interactive scenarios, where user intents evolve and queries must be refined over multiple turns. In applications such as finance and business analytics, users iteratively adjust query constraints or dimensions based on intermediate results. To evaluate such dynamic capabilities, we introduce DySQL-Bench, a benchmark assessing model performance under evolving user interactions. Unlike previous manually curated datasets, DySQL-Bench is built through an automated two-stage pipeline of task synthesis and verification. Structured tree representations derived from raw database tables guide LLM-based task generation, followed by interaction-oriented filtering and expert validation. Human evaluation confirms 100% correctness of the synthesized data. We further propose a multi-turn evaluation framework simulating realistic interactions among an LLM-simulated user, the model under test, and an executable database. The model must adapt its reasoning and SQL generation as user intents change. DySQL-Bench covers 13 domains across BIRD and Spider 2 databases, totaling 1,072 tasks. Even GPT-4o attains only 58.34% overall accuracy and 23.81% on the Pass@5 metric, underscoring the benchmark’s difficulty. All code and data are released at https://github.com/Aurora-slz/Real-World-SQL-Bench .
cs.SE [Back]
[90] SecureReviewer: Enhancing Large Language Models for Secure Code Review through Secure-aware Fine-tuning
Fang Liu,Simiao Liu,Yinghao Zhu,Xiaoli Lian,Li Zhang
Main category: cs.SE
TL;DR: SecureReviewer通过安全感知微调增强大语言模型在安全代码审查中的能力,解决了现有方法在安全相关问题上的不足,提出了新数据集、RAG技术和SecureBLEU评估指标。
Details
Motivation: 现有LLM-based代码审查方法主要关注通用目的,对安全相关问题的识别和解决能力不足,且面临数据稀缺和评估指标不完善的问题。SecureReviewer旨在填补这一空白。Contribution: 1)构建了针对安全代码审查的数据集;2)提出了安全感知微调策略;3)引入RAG技术减少幻觉;4)设计了SecureBLEU评估指标。
Method: 通过安全感知微调策略优化LLM生成代码审查评论的能力,结合RAG技术引用领域安全知识,并基于新数据集训练和评估。
Result: 实验表明,SecureReviewer在安全问题的检测准确性和评论质量上均优于现有基准方法。
Insight: 安全代码审查需要领域特定的数据和知识支持,结合专有评估指标和可靠性增强技术(如RAG)可以显著提升模型性能。
Abstract: Identifying and addressing security issues during the early phase of the development lifecycle is critical for mitigating the long-term negative impacts on software systems. Code review serves as an effective practice that enables developers to check their teammates’ code before integration into the codebase. To streamline the generation of review comments, various automated code review approaches have been proposed, where LLM-based methods have significantly advanced the capabilities of automated review generation. However, existing models primarily focus on general-purpose code review, their effectiveness in identifying and addressing security-related issues remains underexplored. Moreover, adapting existing code review approaches to target security issues faces substantial challenges, including data scarcity and inadequate evaluation metrics. To address these limitations, we propose SecureReviewer, a new approach designed for enhancing LLMs’ ability to identify and resolve security-related issues during code review. Specifically, we first construct a dataset tailored for training and evaluating secure code review capabilities. Leveraging this dataset, we fine-tune LLMs to generate code review comments that can effectively identify security issues and provide fix suggestions with our proposed secure-aware fine-tuning strategy. To mitigate hallucination in LLMs and enhance the reliability of their outputs, we integrate the RAG technique, which grounds the generated comments in domain-specific security knowledge. Additionally, we introduce SecureBLEU, a new evaluation metric designed to assess the effectiveness of review comments in addressing security issues. Experimental results demonstrate that SecureReviewer outperforms state-of-the-art baselines in both security issue detection accuracy and the overall quality and practical utility of generated review comments.
cs.AI [Back]
[91] Through the Judge’s Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters
Xingjian Zhang,Tianhong Gao,Suliang Jin,Tianhao Wang,Teng Ye,Eytan Adar,Qiaozhu Mei
Main category: cs.AI
TL;DR: 该论文提出了一种人机协作框架,通过推断思维痕迹(thinking traces)提高LLM评分者在主观任务中的可靠性。该方法包括拒绝采样技术和思维痕迹的应用,显著提升了LLM与人类评分的一致性。
Details
Motivation: LLM在主观评价任务中的可靠性受限于其缺乏人类复杂的推理能力。思维痕迹可以弥补这一缺陷,但直接收集这些痕迹成本高昂且具有挑战性。Contribution: 1. 提出了推断思维痕迹的人机协作框架;2. 使用拒绝采样技术大规模重建思维痕迹;3. 应用思维痕迹优化LLM评分者和标注指南。
Method: 通过拒绝采样方法从仅标签的标注中推断思维痕迹,并将其应用于微调开源LLM评分者和合成更清晰的标注指南。
Result: 实验表明,该方法显著提升了LLM与人类评分的一致性,并提高了不同LLM模型之间的评分一致性。
Insight: LLM可以作为人类思维痕迹的有效代理,通过扩充标注资源提升LLM在主观任务中的可靠性。
Abstract: Large language models (LLMs) are increasingly used as raters for evaluation tasks. However, their reliability is often limited for subjective tasks, when human judgments involve subtle reasoning beyond annotation labels. Thinking traces, the reasoning behind a judgment, are highly informative but challenging to collect and curate. We present a human-LLM collaborative framework to infer thinking traces from label-only annotations. The proposed framework uses a simple and effective rejection sampling method to reconstruct these traces at scale. These inferred thinking traces are applied to two complementary tasks: (1) fine-tuning open LLM raters; and (2) synthesizing clearer annotation guidelines for proprietary LLM raters. Across multiple datasets, our methods lead to significantly improved LLM-human agreement. Additionally, the refined annotation guidelines increase agreement among different LLM models. These results suggest that LLMs can serve as practical proxies for otherwise unrevealed human thinking traces, enabling label-only corpora to be extended into thinking-trace-augmented resources that enhance the reliability of LLM raters.
[92] Approximating Human Preferences Using a Multi-Judge Learned System
Eitán Sprejer,Fernando Avalos,Augusto Bernardi,Jose Pedro Brito de Azevedo Faustino,Jacob Haimes,Narmeen Fatimah Oozeer
Main category: cs.AI
TL;DR: 这篇论文提出了一个框架,通过学习聚合多个基于评分标准的评委输出来建模多样化的、基于人物角色的偏好,以解决LLM评委与人类偏好对齐的挑战。
Details
Motivation: LLM评委难以校准,存在评分标准敏感性、偏见和不稳定性等问题,这阻碍了其在RLHF奖励模型和路由系统等关键应用中的可靠性。Contribution: 主要贡献包括:1)一种基于人物角色的方法,用于大规模合成偏好标签;2)两种聚合器的实现(GAM和MLP)。
Method: 提出了一种框架,通过学习聚合多个基于评分标准的评委输出来建模偏好,并比较了GAM和MLP两种实现方式。
Result: 研究表明,该方法在多样化和鲁棒性方面优于朴素基线,并通过案例分析了评委偏见的鲁棒性。
Insight: 多评委学习系统可以有效模拟人类偏好的多样性,并通过聚合降低单一评委的偏见和不稳定性。
Abstract: Aligning LLM-based judges with human preferences is a significant challenge, as they are difficult to calibrate and often suffer from rubric sensitivity, bias, and instability. Overcoming this challenge advances key applications, such as creating reliable reward models for Reinforcement Learning from Human Feedback (RLHF) and building effective routing systems that select the best-suited model for a given user query. In this work, we propose a framework for modeling diverse, persona-based preferences by learning to aggregate outputs from multiple rubric-conditioned judges. We investigate the performance of this approach against naive baselines and assess its robustness through case studies on both human and LLM-judges biases. Our primary contributions include a persona-based method for synthesizing preference labels at scale and two distinct implementations of our aggregator: Generalized Additive Model (GAM) and a Multi-Layer Perceptron (MLP).
[93] Reasoning Curriculum: Bootstrapping Broad LLM Reasoning from Math
Bo Pang,Deqian Kong,Silvio Savarese,Caiming Xiong,Yingbo Zhou
Main category: cs.AI
TL;DR: 论文提出了一种名为‘Reasoning Curriculum’的两阶段课程学习方法,通过在数学领域训练大语言模型的推理能力,再迁移到其他领域,实现广义推理。
Details
Motivation: 当前强化学习在大语言模型中主要集中于数学和代码领域,限制了模型的泛化推理能力。作者希望通过领域迁移的方法提升模型的通用推理能力。Contribution: 提出了一个简单、通用的两阶段课程学习方法(Reasoning Curriculum),无需专用奖励模型,即可提升模型在多领域的推理能力。
Method: 1. 阶段一:在数学领域通过可验证奖励训练推理能力;2. 阶段二:在多领域数据上进行联合强化学习,迁移和巩固推理能力。
Result: 在Qwen3-4B和Llama-3.1-8B上的多领域评估中,该方法均取得了显著的性能提升。
Insight: 数学优先的训练策略能够激发对解决复杂问题至关重要的认知行为,两阶段缺一不可。
Abstract: Reinforcement learning (RL) can elicit strong reasoning in large language models (LLMs), yet most open efforts focus on math and code. We propose Reasoning Curriculum, a simple two-stage curriculum that first elicits reasoning skills in pretraining-aligned domains such as math, then adapts and refines these skills across other domains via joint RL. Stage 1 performs a brief cold start and then math-only RL with verifiable rewards to develop reasoning skills. Stage 2 runs joint RL on mixed-domain data to transfer and consolidate these skills. The curriculum is minimal and backbone-agnostic, requiring no specialized reward models beyond standard verifiability checks. Evaluated on Qwen3-4B and Llama-3.1-8B over a multi-domain suite, reasoning curriculum yields consistent gains. Ablations and a cognitive-skill analysis indicate that both stages are necessary and that math-first elicitation increases cognitive behaviors important for solving complex problems. Reasoning Curriculum provides a compact, easy-to-adopt recipe for general reasoning.
[94] One Model to Critique Them All: Rewarding Agentic Tool-Use via Efficient Reasoning
Renhao Li,Jianhong Tu,Yang Su,Hamid Alinejad-Rokny,Derek F. Wong,Junyang Lin,Min Yang
Main category: cs.AI
TL;DR: 论文提出了ToolRM,一种轻量级的生成式奖励模型家族,专注于工具使用任务,通过新颖的数据构建和评估方法显著提升了任务性能。
Details
Motivation: 当前的工具学习领域缺乏专为函数调用任务设计的奖励模型,限制了自主AI的发展。因此,作者提出了ToolRM以填补这一空白。Contribution: 1. 提出了ToolRM家族,专注于工具使用任务;2. 开发了ToolPref-Pairwise-30K数据集;3. 设计了TRBench$_{BFCL}$评估基准;4. 展示了ToolRM在多任务中的泛化能力。
Method: 通过规则评分和多维采样构建偏好数据,训练轻量级生成式奖励模型ToolRM。
Result: ToolRM在评估中表现优异,比前沿模型(如Claude 4和OpenAI o3)准确率高14.28%,并在其他任务(如Best-of-N采样和自我校正)中展现了泛化能力。
Insight: 专用奖励模型对提升工具学习任务性能至关重要,轻量级设计和高泛化能力使其在实际应用中更具效率。
Abstract: Reward models (RMs) play a critical role in aligning large language models (LLMs) with human preferences. Yet in the domain of tool learning, the lack of RMs specifically designed for function-calling tasks has limited progress toward more capable agentic AI. We introduce ToolRM, a family of lightweight generative RMs tailored for general tool-use scenarios. To build these models, we propose a novel pipeline that constructs pairwise preference data using rule-based scoring and multidimensional sampling. This yields ToolPref-Pairwise-30K, a diverse, balanced, and challenging dataset of critique tasks that supports reinforcement learning with verifiable feedback. To evaluate tool-use RMs, we also introduce TRBench$_{BFCL}$, a benchmark built on the agentic evaluation suite BFCL. Trained on our constructed data, models from the Qwen3-4B/8B series achieve up to 14.28% higher accuracy, substantially outperforming frontier models such as Claude 4 and OpenAI o3 in pairwise reward judgments. Beyond training objectives, ToolRM generalizes to broader critique tasks, including Best-of-N sampling and self-correction. Experiments on ACEBench highlight its effectiveness and efficiency, enabling inference-time scaling and reducing output token usage by over 66%. We release data and model checkpoints to facilitate future research.
[95] Normative Reasoning in Large Language Models: A Comparative Benchmark from Logical and Modal Perspectives
Kentaro Ozeki,Risako Ando,Takanobu Morishita,Hirohiko Abe,Koji Mineshima,Mitsuhiro Okada
Main category: cs.AI
TL;DR: 这篇论文系统评估了大语言模型(LLMs)在规范推理(Normative Reasoning)上的能力,通过逻辑和模态视角比较了其对规范性模态和认知模态的处理能力。研究发现LLMs虽然总体上遵循有效推理模式,但在特定类型的规范性推理中存在不一致性,并表现出与人类心理研究中类似的认知偏差。
Details
Motivation: 规范推理涉及义务和许可等模态,而LLMs在这方面的能力尚未得到充分探索。作者旨在填补这一空白,通过逻辑和模态视角全面评估LLMs的规范性推理能力。Contribution: 1. 引入了一个新的数据集,覆盖规范和认知域的广泛推理模式,并纳入影响人类推理的非形式认知因素;2. 揭示了LLMs在特定规范性推理中的不一致性和认知偏差;3. 提出了提升LLMs可靠性的见解。
Method: 1. 构建了一个包含规范和认知推理形式的数据集;2. 对比分析了LLMs在处理规范性模态和认知模态时的表现;3. 通过逻辑和模态视角评估模型的推理一致性。
Result: LLMs总体上遵循有效推理模式,但在特定规范性推理中存在不一致性,并表现出与人类类似的认知偏差。这表明LLMs在逻辑一致性方面仍存在挑战。
Insight: 研究强调了LLMs在规范性推理领域的局限性,并为未来提升其可靠性和逻辑一致性提供了方向。
Abstract: Normative reasoning is a type of reasoning that involves normative or deontic modality, such as obligation and permission. While large language models (LLMs) have demonstrated remarkable performance across various reasoning tasks, their ability to handle normative reasoning remains underexplored. In this paper, we systematically evaluate LLMs’ reasoning capabilities in the normative domain from both logical and modal perspectives. Specifically, to assess how well LLMs reason with normative modals, we make a comparison between their reasoning with normative modals and their reasoning with epistemic modals, which share a common formal structure. To this end, we introduce a new dataset covering a wide range of formal patterns of reasoning in both normative and epistemic domains, while also incorporating non-formal cognitive factors that influence human reasoning. Our results indicate that, although LLMs generally adhere to valid reasoning patterns, they exhibit notable inconsistencies in specific types of normative reasoning and display cognitive biases similar to those observed in psychological studies of human reasoning. These findings highlight challenges in achieving logical consistency in LLMs’ normative reasoning and provide insights for enhancing their reliability. All data and code are released publicly at https://github.com/kmineshima/NeuBAROCO.
[96] The Era of Agentic Organization: Learning to Organize with Language Models
Zewen Chi,Li Dong,Qingxiu Dong,Yaru Hao,Xun Wu,Shaohan Huang,Furu Wei
Main category: cs.AI
TL;DR: 该论文提出了一种名为‘异步思考(AsyncThink)’的新型推理范式,通过语言模型实现多代理协作,以解决复杂问题。该方法动态分配子任务、合并中间知识,并通过强化学习优化结构,实验显示其降低了推理延迟并提高了数学推理准确性。
Details
Motivation: 研究者希望突破单个智能体的局限性,通过多代理协作解决复杂问题,从而开启‘代理组织’的新时代。Contribution: 提出了异步思考范式(AsyncThink),为语言模型引入动态任务分配和知识合并的能力,并通过强化学习优化协作结构。
Method: 设计了动态分配子查询给工作代理的协议,合并中间结果以生成一致解,并通过强化学习优化思考结构。
Result: 实验表明,AsyncThink的推理延迟降低28%,数学推理准确性提升,且能泛化到未见任务。
Insight: 异步协作不仅能提高效率,还能增强模型的泛化能力,为未来多代理系统设计提供了新思路。
Abstract: We envision a new era of AI, termed agentic organization, where agents solve complex problems by working collaboratively and concurrently, enabling outcomes beyond individual intelligence. To realize this vision, we introduce asynchronous thinking (AsyncThink) as a new paradigm of reasoning with large language models, which organizes the internal thinking process into concurrently executable structures. Specifically, we propose a thinking protocol where an organizer dynamically assigns sub-queries to workers, merges intermediate knowledge, and produces coherent solutions. More importantly, the thinking structure in this protocol can be further optimized through reinforcement learning. Experiments demonstrate that AsyncThink achieves 28% lower inference latency compared to parallel thinking while improving accuracy on mathematical reasoning. Moreover, AsyncThink generalizes its learned asynchronous thinking capabilities, effectively tackling unseen tasks without additional training.
[97] Cross-Platform Evaluation of Reasoning Capabilities in Foundation Models
J. de Curtò,I. de Zarzà,Pablo García,Jordi Cabot
Main category: cs.AI
TL;DR: 该论文对不同基础设施上的基础模型的推理能力进行了跨平台评估,涵盖高性能计算、云平台和大学集群,并通过三个阶段实验验证了方法的可重复性和泛化性。
Details
Motivation: 当前基础模型的推理能力评估通常局限于单一平台,缺乏跨平台的比较。本文旨在建立一个与基础设施无关的基准,以全面评估不同计算范式下的模型性能。Contribution: 1. 提出了一个基础设施无关的基准,涵盖三个计算范式(高性能计算、云平台、大学集群);2. 评估了15个基础模型在79个跨学科问题上的表现;3. 揭示了训练数据质量比模型规模更重要的新见解。
Method: 1. 基础阶段:在高性能计算平台上评估6个模型,建立基准;2. 基础设施验证:在大学集群和云平台上复现基准,验证方法的可重复性;3. 扩展评估:在两个平台上进行全面评估,测试模型的泛化能力。
Result: 研究发现挑战了传统的模型规模扩展假设,表明训练数据质量对推理能力的影响更大,并提供了跨教育、生产和研究场景的模型选择指南。
Insight: 训练数据的质量比模型规模更能影响推理能力,基础设施的选择对模型的性能评估具有重要影响。
Abstract: This paper presents a comprehensive cross-platform evaluation of reasoning capabilities in contemporary foundation models, establishing an infrastructure-agnostic benchmark across three computational paradigms: HPC supercomputing (MareNostrum 5), cloud platforms (Nebius AI Studio), and university clusters (a node with eight H200 GPUs). We evaluate 15 foundation models across 79 problems spanning eight academic domains (Physics, Mathematics, Chemistry, Economics, Biology, Statistics, Calculus, and Optimization) through three experimental phases: (1) Baseline establishment: Six models (Mixtral-8x7B, Phi-3, LLaMA 3.1-8B, Gemma-2-9b, Mistral-7B, OLMo-7B) evaluated on 19 problems using MareNostrum 5, establishing methodology and reference performance; (2) Infrastructure validation: The 19-problem benchmark repeated on university cluster (seven models including Falcon-Mamba state-space architecture) and Nebius AI Studio (nine state-of-the-art models: Hermes-4 70B/405B, LLaMA 3.1-405B/3.3-70B, Qwen3 30B/235B, DeepSeek-R1, GPT-OSS 20B/120B) to confirm infrastructure-agnostic reproducibility; (3) Extended evaluation: Full 79-problem assessment on both university cluster and Nebius platforms, probing generalization at scale across architectural diversity. The findings challenge conventional scaling assumptions, establish training data quality as more critical than model size, and provide actionable guidelines for model selection across educational, production, and research contexts. The tri-infrastructure methodology and 79-problem benchmark enable longitudinal tracking of reasoning capabilities as foundation models evolve.
cs.LG [Back]
[98] MemEIC: A Step Toward Continual and Compositional Knowledge Editing
Jin Seong,Jiyun Park,Wencke Liermann,Hongseok Choi,Yoonji Nam,Hyun Kim,Soojong Lim,Namhoon Lee
Main category: cs.LG
TL;DR: MemEIC提出了一种新的多模态持续知识编辑方法,通过跨模态检索和分离参数更新的方式,支持视觉和文本知识的组合编辑,同时保持了先前编辑的效果。
Details
Motivation: 现有知识编辑技术多针对单一模态(视觉或语言),忽略了多模态模型的交互性和知识的持续更新需求,导致编辑效果不佳。MemEIC通过组合编辑和持续更新的方法解决了这一问题。Contribution: 1. 提出MemEIC方法,支持视觉和文本知识的组合编辑;2. 设计了混合外部-内部编辑器,包括跨模态证据检索的双重外部内存和分离参数更新的双重LoRA适配器;3. 引入基于脑启发的知识连接器,支持跨模态组合推理。
Method: 1. 使用双重外部内存实现跨模态证据检索;2. 采用双重LoRA适配器分离视觉和文本模态的参数更新;3. 通过选择性激活的知识连接器整合多模态信息。
Result: 实验表明,MemEIC在复杂多模态问题上表现优异,同时能有效保留先前编辑的知识,为LVLMs的持续组合知识编辑设立了新基准。
Insight: 多模态知识的组合编辑需要考虑模态间的交互性,而持续更新机制则需兼顾新知识的引入和旧知识的保留,MemEIC的双重设计和脑启发连接器为此提供了有效解决方案。
Abstract: The dynamic nature of information necessitates continuously updating large vision-language models (LVLMs). While recent knowledge editing techniques hint at promising directions, they often focus on editing a single modality (vision or language) in isolation. This prevalent practice neglects the inherent multimodality of LVLMs and the continuous nature of knowledge updates, potentially leading to suboptimal editing outcomes when considering the interplay between modalities and the need for ongoing knowledge refinement. To address these limitations, we propose MemEIC, a novel method for Continual and Compositional Knowledge Editing (CCKE) in LVLMs. MemEIC enables compositional editing of both visual and textual knowledge sequentially. Our approach employs a hybrid external-internal editor featuring a dual external memory for cross-modal evidence retrieval and dual LoRA adapters that facilitate disentangled parameter updates for each modality. A key component is a brain-inspired knowledge connector, activated selectively for compositional reasoning, that integrates information across different modalities. Experiments demonstrate that MemEIC significantly improves performance on complex multimodal questions and effectively preserves prior edits, setting a new benchmark for CCKE in LVLMs.
[99] Metis-SPECS: Decoupling Multimodal Learning via Self-distilled Preference-based Cold Start
Kun Chen,Peng Shi,Haibo Qiu,Zhixiong Zeng,Siqi Yang,Wenji Mao,Lin Ma
Main category: cs.LG
TL;DR: SPECS是一个通过自蒸馏偏好冷启动解耦多模态学习的框架,优于传统SFT方法,提升泛化能力和性能上限。
Details
Motivation: 传统基于监督微调(SFT)的冷启动方法存在指令风格过拟合和泛化能力弱的问题,影响下游强化学习(RL)效果。Contribution: 提出SPECS框架,通过自蒸馏生成偏好数据对,解耦多模态学习,专注于浅层可迁移的表面形式标准。
Method: 1. 自蒸馏生成偏好数据对;2. 偏好训练学习格式和风格;3. 移交RL进行深度推理。
Result: 在多模态基准测试中表现优异,MEGA-Bench提升4.1%,MathVista提升12.2%,并改善了训练稳定性和探索能力。
Insight: 偏好训练比SFT更适合冷启动,解耦学习可提升泛化和性能上限。
Abstract: Reinforcement learning (RL) with verifiable rewards has recently catalyzed a wave of “MLLM-r1” approaches that bring RL to vision language models. Most representative paradigms begin with a cold start, typically employing supervised fine-tuning (SFT), to initialize the policy before RL. However, SFT-based cold start adopts the reasoning paradigm intertwined with task solution and output format, which may induce instruction-style overfitting, weakens out-of-distribution generalization, and ultimately affects downstream RL. We revisit the cold start along two views, its training method and data construction, and introduce the Generalization Factor (GF) coefficient to quantify the generalization capability under different methods. Our empirical study finds that preference-based training methods (e.g. DPO) generalizes better than SFT-based methods in cold start. Motivated by this, we propose SPECS-a Self-distilled, Preference-based Cold Start framework that decouples multimodal learning: (1) generates introspective preference data pairs via self-distillation, avoiding reliance on larger teachers or manual annotation; (2) performs preference-based training to learn, focusing on shallow, transferable surface-form criteria (format, structure, style) rather than memorizing content; and (3) hands off to RL with verifiable rewards for deep reasoning results. Experimental results across multiple multimodal benchmarks show that our decoupling learning framework yields consistent performance gains over strong baselines, improving MEGA-Bench by 4.1% and MathVista by 12.2%. Additional experiments indicate that SPECS contributes to reducing in-distribution “stuckness,” improving exploration, stabilizing training, and raising the performance ceiling.
[100] CorVS: Person Identification via Video Trajectory-Sensor Correspondence in a Real-World Warehouse
Kazuma Kano,Yuki Mori,Shin Katayama,Kenta Urano,Takuro Yonezawa,Nobuo Kawaguchi
Main category: cs.LG
TL;DR: 提出CorVS方法,利用视觉轨迹与传感器数据的对应关系进行人员识别,适用于真实仓库环境。
Details
Motivation: 在工业场景中,工人定位数据对提升效率至关重要,但仅靠视觉数据识别个体不切实际,现有方法在真实条件下可能失效。Contribution: 提出CorVS方法,通过深度学习模型预测轨迹和传感器数据的对应关系,并结合算法实现动态匹配。
Method: 1. 使用深度学习模型预测轨迹与传感器数据的对应概率和可靠性;2. 基于概率和可靠性动态匹配轨迹和传感器数据。
Result: 在真实仓库操作数据集上验证了方法的有效性。
Insight: 结合视觉轨迹和传感器数据可以克服单一模态的局限性,提升真实场景下的人员识别鲁棒性。
Abstract: Worker location data is key to higher productivity in industrial sites. Cameras are a promising tool for localization in logistics warehouses since they also offer valuable environmental contexts such as package status. However, identifying individuals with only visual data is often impractical. Accordingly, several prior studies identified people in videos by comparing their trajectories and wearable sensor measurements. While this approach has advantages such as independence from appearance, the existing methods may break down under real-world conditions. To overcome this challenge, we propose CorVS, a novel data-driven person identification method based on correspondence between visual tracking trajectories and sensor measurements. Firstly, our deep learning model predicts correspondence probabilities and reliabilities for every pair of a trajectory and sensor measurements. Secondly, our algorithm matches the trajectories and sensor measurements over time using the predicted probabilities and reliabilities. We developed a dataset with actual warehouse operations and demonstrated the method’s effectiveness for real-world applications.
[101] Deep sequence models tend to memorize geometrically; it is unclear why
Shahriar Noroozizadeh,Vaishnavh Nagarajan,Elan Rosenfeld,Sanjiv Kumar
Main category: cs.LG
TL;DR: 该论文探讨了序列建模中记忆的几何特性,揭示了传统关联记忆模型的局限性,并提出了一个新的几何存储视角。
Details
Motivation: 传统序列模型将记忆抽象为实体间的共现关联,但作者发现这种做法无法解释模型如何在不共现实体间建立全局关系,因此需要从几何角度重新理解记忆存储。Contribution: 论文的主要贡献在于揭示了神经网络记忆的几何特性,证明了模型能够自行合成全局几何关系,而非仅仅依赖训练时的局部共现信息。
Method: 作者通过分析Transformer模型中一个干净且可分析的推理实例,对比了关联记忆和几何记忆的差异,并结合Node2Vec分析了几何特性的光谱偏差来源。
Result: 研究发现,即使在没有显式优化压力的情况下,模型仍会自然地学习到一种优雅的几何结构,这种结构简化了复杂推理任务。
Insight: 论文提供了对神经网络记忆的全新视角,表明几何特性可能是模型成功的关键因素,并指出在知识获取和能力优化中可以利用这一特性。
Abstract: In sequence modeling, the parametric memory of atomic facts has been predominantly abstracted as a brute-force lookup of co-occurrences between entities. We contrast this associative view against a geometric view of how memory is stored. We begin by isolating a clean and analyzable instance of Transformer reasoning that is incompatible with memory as strictly a storage of the local co-occurrences specified during training. Instead, the model must have somehow synthesized its own geometry of atomic facts, encoding global relationships between all entities, including non-co-occurring ones. This in turn has simplified a hard reasoning task involving an $\ell$-fold composition into an easy-to-learn 1-step geometric task. From this phenomenon, we extract fundamental aspects of neural embedding geometries that are hard to explain. We argue that the rise of such a geometry, despite optimizing over mere local associations, cannot be straightforwardly attributed to typical architectural or optimizational pressures. Counterintuitively, an elegant geometry is learned even when it is not more succinct than a brute-force lookup of associations. Then, by analyzing a connection to Node2Vec, we demonstrate how the geometry stems from a spectral bias that – in contrast to prevailing theories – indeed arises naturally despite the lack of various pressures. This analysis also points to practitioners a visible headroom to make Transformer memory more strongly geometric. We hope the geometric view of parametric memory encourages revisiting the default intuitions that guide researchers in areas like knowledge acquisition, capacity, discovery and unlearning.
[102] Remote Labor Index: Measuring AI Automation of Remote Work
Mantas Mazeika,Alice Gatti,Cristina Menghini,Udari Madhushani Sehwag,Shivam Singhal,Yury Orlovskiy,Steven Basart,Manasi Sharma,Denis Peskoff,Elaine Lau,Jaehyuk Lim,Lachlan Carroll,Alice Blair,Vinaya Sivakumar,Sumana Basu,Brad Kenstler,Yuntao Ma,Julian Michael,Xiaoke Li,Oliver Ingebretsen,Aditya Mehta,Jean Mottola,John Teichmann,Kevin Yu,Zaina Shaik,Adam Khoja,Richard Ren,Jason Hausenloy,Long Phan,Ye Htet,Ankit Aich,Tahseen Rabbani,Vivswan Shah,Andriy Novykov,Felix Binder,Kirill Chugunov,Luis Ramirez,Matias Geralnik,Hernán Mesura,Dean Lee,Ed-Yeremai Hernandez Cardona,Annette Diamond,Summer Yue,Alexandr Wang,Bing Liu,Ernesto Hernandez,Dan Hendrycks
Main category: cs.LG
TL;DR: 该论文提出了远程劳动力指数(RLI),用于衡量AI在远程工作中的自动化表现。研究表明,当前AI代理在RLI上的表现接近于基准下限,自动化率仅为2.5%。
Details
Motivation: AI在研究性基准测试中取得了快速进展,但这些进展如何转化为经济价值和自动化尚不明确。RLI旨在提供一个多领域的现实基准,量化AI在实际工作中的自动化能力。Contribution: 引入了远程劳动力指数(RLI),作为一个多领域的现实基准,用于评估AI代理在端到端任务中的表现,并为AI自动化讨论提供了实证依据。
Method: RLI通过设计和评估一系列具有经济价值的真实项目,衡量AI代理在这些项目中的自动化表现。
Result: AI代理在RLI上的表现接近最低水平,最高自动化率仅为2.5%。
Insight: RLI为跟踪AI对劳动力市场的影响提供了实证基础,帮助利益相关者更好地应对AI驱动的自动化。
Abstract: AIs have made rapid progress on research-oriented benchmarks of knowledge and reasoning, but it remains unclear how these gains translate into economic value and automation. To measure this, we introduce the Remote Labor Index (RLI), a broadly multi-sector benchmark comprising real-world, economically valuable projects designed to evaluate end-to-end agent performance in practical settings. AI agents perform near the floor on RLI, with the highest-performing agent achieving an automation rate of 2.5%. These results help ground discussions of AI automation in empirical evidence, setting a common basis for tracking AI impacts and enabling stakeholders to proactively navigate AI-driven labor automation.
[103] Defeating the Training-Inference Mismatch via FP16
Penghui Qi,Zichen Liu,Xiangxin Zhou,Tianyu Pang,Chao Du,Wee Sun Lee,Min Lin
Main category: cs.LG
TL;DR: 本文提出通过从BF16转换到FP16,解决了强化学习(RL)微调大语言模型(LLMs)时训练与推理策略之间数值不匹配的问题,从而实现了更稳定的优化和更好的性能。
Details
Motivation: 大语言模型在RL微调过程中,训练和推理阶段的不一致性导致优化不稳定,现有方法多从算法或工程角度尝试解决,而忽略了浮点精度的根本影响。Contribution: 揭示了BF16浮点精度引入的大幅舍入误差是训练与推理不匹配的根本原因,提出改用FP16可简单高效地解决这一问题。
Method: 无需修改模型架构或学习算法,仅需将训练和推理的浮点精度统一为FP16,即可消除不匹配问题。
Result: 实验表明,FP16在多任务、算法和框架中均带来更稳定的优化、更快的收敛和更强的性能。
Insight: 浮点精度的选择对RL微调至关重要,FP16虽动态范围较窄,但能有效减少舍入误差,从而实现训练与推理的一致性。
Abstract: Reinforcement learning (RL) fine-tuning of large language models (LLMs) often suffers from instability due to the numerical mismatch between the training and inference policies. While prior work has attempted to mitigate this issue through algorithmic corrections or engineering alignments, we show that its root cause lies in the floating point precision itself. The widely adopted BF16, despite its large dynamic range, introduces large rounding errors that breaks the consistency between training and inference. In this work, we demonstrate that simply reverting to \textbf{FP16} effectively eliminates this mismatch. The change is simple, fully supported by modern frameworks with only a few lines of code change, and requires no modification to the model architecture or learning algorithm. Our results suggest that using FP16 uniformly yields more stable optimization, faster convergence, and stronger performance across diverse tasks, algorithms and frameworks. We hope these findings motivate a broader reconsideration of precision trade-offs in RL fine-tuning.
cs.RO [Back]
[104] DARTS: A Drone-Based AI-Powered Real-Time Traffic Incident Detection System
Bai Li,Achilleas Kourtellis,Rong Cao,Joseph Post,Brian Porter,Yu Zhang
Main category: cs.RO
TL;DR: DARTS是一个基于无人机和AI的实时交通事件检测系统,结合了无人机的高机动性和热成像技术,通过轻量级深度学习框架实现高精度检测,并在实地测试中展现了显著的优势。
Details
Motivation: 传统交通事件检测方法存在灵活性低、依赖基础设施和高渗透率的问题,DARTS旨在通过无人机和AI技术提供更灵活、高效的解决方案。Contribution: DARTS提出了一个结合无人机、热成像和轻量级深度学习的系统,实现了99%的检测准确率,并支持实时视觉验证和拥堵传播监测。
Method: 系统利用无人机的机动性和热成像技术,采用轻量级深度学习框架提取车辆轨迹并检测事件,通过web界面支持实时验证和拥堵监控。
Result: 在实地测试中,DARTS比传统方法提前12分钟检测到事件,并能监测拥堵传播,展现了更高的效率和适应性。
Insight: DARTS的灵活部署架构表明其在偏远地区和资源受限环境中具有潜在的可扩展性和成本效益。
Abstract: Rapid and reliable incident detection is critical for reducing crash-related fatalities, injuries, and congestion. However, conventional methods, such as closed-circuit television, dashcam footage, and sensor-based detection, separate detection from verification, suffer from limited flexibility, and require dense infrastructure or high penetration rates, restricting adaptability and scalability to shifting incident hotspots. To overcome these challenges, we developed DARTS, a drone-based, AI-powered real-time traffic incident detection system. DARTS integrates drones’ high mobility and aerial perspective for adaptive surveillance, thermal imaging for better low-visibility performance and privacy protection, and a lightweight deep learning framework for real-time vehicle trajectory extraction and incident detection. The system achieved 99% detection accuracy on a self-collected dataset and supports simultaneous online visual verification, severity assessment, and incident-induced congestion propagation monitoring via a web-based interface. In a field test on Interstate 75 in Florida, DARTS detected and verified a rear-end collision 12 minutes earlier than the local transportation management center and monitored incident-induced congestion propagation, suggesting potential to support faster emergency response and enable proactive traffic control to reduce congestion and secondary crash risk. Crucially, DARTS’s flexible deployment architecture reduces dependence on frequent physical patrols, indicating potential scalability and cost-effectiveness for use in remote areas and resource-constrained settings. This study presents a promising step toward a more flexible and integrated real-time traffic incident detection system, with significant implications for the operational efficiency and responsiveness of modern transportation management.
[105] Self-localization on a 3D map by fusing global and local features from a monocular camera
Satoshi Kikuch,Masaya Kato,Tsuyoshi Tasaki
Main category: cs.RO
TL;DR: 该论文提出了一种结合CNN和Vision Transformer的新方法,用于单目相机在3D地图上的自我定位,解决了动态障碍物干扰下CNN效果不佳的问题,实验表明在精度和误差上优于现有方法。
Details
Motivation: 自主驾驶需低成本单目相机在3D地图上实现高精度定位,但传统CNN在动态障碍物(如行人)存在时效果不佳,因此需要结合全局特征提取的方法。Contribution: 1. 提出结合CNN和Vision Transformer的新方法,融合局部和全局特征;2. 在动态障碍物场景下显著提升了定位精度;3. 在公开数据集上误差比SOTA减少20.1%。
Method: 使用CNN提取局部特征(邻近像素),结合Vision Transformer提取全局特征(图像块间关系),融合两者以提高动态障碍物存在时的定位鲁棒性。
Result: 1. 在CG数据集中,动态障碍物下的精度提升1.5倍;2. 公开数据集上误差比SOTA减少20.1%;3. 平均定位误差降至7.51cm。
Insight: 融合局部和全局特征的方法能有效提升动态环境下的定位鲁棒性,Vision Transformer在全局特征提取上的优势对定位任务至关重要。
Abstract: Self-localization on a 3D map by using an inexpensive monocular camera is required to realize autonomous driving. Self-localization based on a camera often uses a convolutional neural network (CNN) that can extract local features that are calculated by nearby pixels. However, when dynamic obstacles, such as people, are present, CNN does not work well. This study proposes a new method combining CNN with Vision Transformer, which excels at extracting global features that show the relationship of patches on whole image. Experimental results showed that, compared to the state-of-the-art method (SOTA), the accuracy improvement rate in a CG dataset with dynamic obstacles is 1.5 times higher than that without dynamic obstacles. Moreover, the self-localization error of our method is 20.1% smaller than that of SOTA on public datasets. Additionally, our robot using our method can localize itself with 7.51cm error on average, which is more accurate than SOTA.
[106] AgriGS-SLAM: Orchard Mapping Across Seasons via Multi-View Gaussian Splatting SLAM
Mirko Usuelli,David Rapado-Rincon,Gert Kootstra,Matteo Matteucci
Main category: cs.RO
TL;DR: AgriGS-SLAM是一个视觉-LiDAR SLAM框架,结合多相机3D高斯喷洒(3DGS)渲染,用于果园跨季节的实时3D场景重建。它通过统一的梯度驱动地图生命周期和LiDAR深度一致性优化,提升了重建质量和轨迹稳定性。
Details
Motivation: 果园环境中存在重复的几何结构、季节性外观变化和风吹树叶的运动,这些挑战使得自主机器人需要更鲁棒的3D场景理解方法。Contribution: 1. 结合LiDAR测距和多相机3DGS渲染的SLAM框架;2. 统一的梯度驱动地图生命周期;3. LiDAR深度一致性优化的位姿细化方法;4. 在实际果园环境中验证了跨季节的性能和实时性。
Method: 1. 使用多相机3DGS渲染恢复遮挡下的果园结构;2. 在关键帧之间执行统一的梯度驱动地图生命周期;3. 通过LiDAR深度一致性项优化相机位姿;4. 在实际果园中部署系统并进行标准化轨迹评估。
Result: AgriGS-SLAM在跨季节和不同果园中实现了更清晰、更稳定的3D重建和轨迹,同时保持实时性能,优于现有3DGS-SLAM基线方法。
Insight: 结合LiDAR和多模态视觉的方法可以有效提升复杂户外环境中的SLAM鲁棒性和重建质量。
Abstract: Autonomous robots in orchards require real-time 3D scene understanding despite repetitive row geometry, seasonal appearance changes, and wind-driven foliage motion. We present AgriGS-SLAM, a Visual–LiDAR SLAM framework that couples direct LiDAR odometry and loop closures with multi-camera 3D Gaussian Splatting (3DGS) rendering. Batch rasterization across complementary viewpoints recovers orchard structure under occlusions, while a unified gradient-driven map lifecycle executed between keyframes preserves fine details and bounds memory. Pose refinement is guided by a probabilistic LiDAR-based depth consistency term, back-propagated through the camera projection to tighten geometry-appearance coupling. We deploy the system on a field platform in apple and pear orchards across dormancy, flowering, and harvesting, using a standardized trajectory protocol that evaluates both training-view and novel-view synthesis to reduce 3DGS overfitting in evaluation. Across seasons and sites, AgriGS-SLAM delivers sharper, more stable reconstructions and steadier trajectories than recent state-of-the-art 3DGS-SLAM baselines while maintaining real-time performance on-tractor. While demonstrated in orchard monitoring, the approach can be applied to other outdoor domains requiring robust multimodal perception.
eess.IV [Back]
[107] Groupwise Registration with Physics-Informed Test-Time Adaptation on Multi-parametric Cardiac MRI
Xinqi Li,Yi Zhang,Li-Ting Huang,Hsiao-Huang Chang,Thoralf Niendorf,Min-Chi Ku,Qian Tao,Hsin-Jung Yang
Main category: eess.IV
TL;DR: 提出了一种基于物理信息测试时适应的深度学习模型,用于解决多参数心脏MRI图像配准中的不对齐问题,通过合成图像作为参考,提高了多模态配准的性能。
Details
Motivation: 多参数映射MRI在心肌组织表征中表现优异,但由于不同参数图像间的错位,像素级分析变得困难。为此,需要一种通用性强的方法实现多对比度图像的配准。Contribution: 1. 开发了一种基于物理信息的深度学习模型,支持测试时适应;2. 利用物理模型生成的合成图像作为配准参考,解决了多对比度图像的配准问题。
Method: 1. 采用物理模型生成合成图像作为参考;2. 通过测试时适应实现转导学习,处理不同组织对比度。
Result: 在健康志愿者的多种MRI序列上验证了模型的有效性,显著提升了多模态配准的性能。
Insight: 物理信息的引入能够有效解决多模态图像配准中的对比度差异问题,测试时适应的方法增加了模型的通用性和灵活性。
Abstract: Multiparametric mapping MRI has become a viable tool for myocardial tissue characterization. However, misalignment between multiparametric maps makes pixel-wise analysis challenging. To address this challenge, we developed a generalizable physics-informed deep-learning model using test-time adaptation to enable group image registration across contrast weighted images acquired from multiple physical models (e.g., a T1 mapping model and T2 mapping model). The physics-informed adaptation utilized the synthetic images from specific physics model as registration reference, allows for transductive learning for various tissue contrast. We validated the model in healthy volunteers with various MRI sequences, demonstrating its improvement for multi-modal registration with a wide range of image contrast variability.
[108] BRIQA: Balanced Reweighting in Image Quality Assessment of Pediatric Brain MRI
Alya Almsouti,Ainur Khamitova,Darya Taratynova,Mohammad Yaqub
Main category: eess.IV
TL;DR: BRIQA提出了一种平衡重加权方法,用于儿科脑部MRI图像质量评估,解决了不同类别不平衡问题,提升了平均宏F1分数。
Details
Motivation: 儿科脑部MRI的人工质量评估耗时且主观,尤其在低场系统中信噪比降低,因此需要鲁棒的自动化解决方案。Contribution: BRIQA通过梯度损失重加权和旋转批次方案,解决了类别不平衡问题,提升了模型性能。
Method: 采用梯度损失重加权动态调整每类贡献,结合旋转批次方案确保较少出现类别的均衡学习。
Result: BRIQA将平均宏F1分数从0.659提升至0.706,在多种伪影严重性分类中表现显著提升。
Insight: 实验表明,不同伪影类型需多样化架构设计,旋转批次方案与交叉熵损失结合能显著提升平衡学习效果。
Abstract: Assessing the severity of artifacts in pediatric brain Magnetic Resonance Imaging (MRI) is critical for diagnostic accuracy, especially in low-field systems where the signal-to-noise ratio is reduced. Manual quality assessment is time-consuming and subjective, motivating the need for robust automated solutions. In this work, we propose BRIQA (Balanced Reweighting in Image Quality Assessment), which addresses class imbalance in artifact severity levels. BRIQA uses gradient-based loss reweighting to dynamically adjust per-class contributions and employs a rotating batching scheme to ensure consistent exposure to underrepresented classes. Through experiments, no single architecture performs best across all artifact types, emphasizing the importance of architectural diversity. The rotating batching configuration improves performance across metrics by promoting balanced learning when combined with cross-entropy loss. BRIQA improves average macro F1 score from 0.659 to 0.706, with notable gains in Noise (0.430), Zipper (0.098), Positioning (0.097), Contrast (0.217), Motion (0.022), and Banding (0.012) artifact severity classification. The code is available at https://github.com/BioMedIA-MBZUAI/BRIQA.
[109] MORE: Multi-Organ Medical Image REconstruction Dataset
Shaokai Wu,Yapan Guo,Yanbiao Ji,Jing Tong,Yuxiang Lu,Mei Li,Suizhi Huang,Yue Ding,Hongtao Lu
Main category: eess.IV
TL;DR: 该论文提出了MORE数据集,用于多器官医学图像重建,包含9种解剖结构和15种病变类型,旨在提升深度学习模型的泛化能力,并提出了一个优于现有方法的基线解决方案。
Details
Motivation: 现有的深度学习CT重建方法通常局限于特定解剖结构和数据集,导致对未见过的解剖结构和病变泛化能力不足。Contribution: 1. 提出了MORE数据集,涵盖多器官和多种病变类型;2. 建立了强基线方法,在复杂条件下优于现有方法。
Method: 基于优化的方法,利用多器官异质性数据训练模型,并通过大量实验验证其有效性。
Result: 1. 数据集的广泛性提升了模型的泛化能力;2. 优化方法对未见解剖结构具有更好的鲁棒性。
Insight: 1. 多样化的数据集对医学图像重建至关重要;2. 基于优化的方法在复杂场景中表现更优。
Abstract: CT reconstruction provides radiologists with images for diagnosis and treatment, yet current deep learning methods are typically limited to specific anatomies and datasets, hindering generalization ability to unseen anatomies and lesions. To address this, we introduce the Multi-Organ medical image REconstruction (MORE) dataset, comprising CT scans across 9 diverse anatomies with 15 lesion types. This dataset serves two key purposes: (1) enabling robust training of deep learning models on extensive, heterogeneous data, and (2) facilitating rigorous evaluation of model generalization for CT reconstruction. We further establish a strong baseline solution that outperforms prior approaches under these challenging conditions. Our results demonstrate that: (1) a comprehensive dataset helps improve the generalization capability of models, and (2) optimization-based methods offer enhanced robustness for unseen anatomies. The MORE dataset is freely accessible under CC-BY-NC 4.0 at our project page https://more-med.github.io/