Table of Contents
cs.CL [Back]
[1] Policy Optimization Prefers The Path of Least Resistance
Debdeep Sanyal,Aakash Sen Sharma,Dhruv Kumar,Saurabh Deshpande,Murari Mandal
Main category: cs.CL
TL;DR: 论文研究了策略优化(PO)在多步推理任务中的行为,发现PO倾向于选择最简单的路径(即直接回答),即使复杂路径(如思考后回答)有更高的奖励权重。
Details
Motivation: 现有研究强制语言模型采用严格的‘思考-回答’格式以生成链式推理(CoT),但PO在这种约束放宽后的开放结构中的行为尚未充分研究。Contribution: 揭示了PO的行为原则:始终选择最小阻力的路径。即使复杂路径有更高奖励,PO仍倾向于简化行为。
Method: 通过一系列控制实验和奖励分解实验,验证PO对不同奖励组件的优化偏好。
Result: PO会优先优化最简单的奖励组件,甚至在有互斥选择或强激励复杂行为时也是如此。此外,PO的这种行为需要足够的KL正则化自由度。
Insight: 赋予策略自由度是双刃剑:虽有助于发现高奖励捷径,但也可能导致奖励函数被滥用(即奖励黑客)。这为对齐问题提出了关键挑战。
Abstract: Policy optimization (PO) algorithms are used to refine Large Language Models for complex, multi-step reasoning. Current state-of-the-art pipelines enforce a strict think-then-answer format to elicit chain-of-thought (CoT); however, the behavior of PO when these rigid constraints are relaxed into an open-ended CoT structure remains an under-studied question. We investigate this gap with an extensive suite of controlled experiments and identify a consistent principle: \textit{policy optimization consistently follows the path of least resistance}. When afforded the flexibility to interleave reasoning and response, policy optimization consistently learns to discard explicit reasoning, causing the policy to degenerate to a direct \texttt{
[2] Framework for Machine Evaluation of Reasoning Completeness in Large Language Models For Classification Tasks
Avinash Patil
Main category: cs.CL
TL;DR: 本文提出了RACE框架,用于评估大型语言模型(LLM)生成的解释与基于逻辑回归的特征重要性之间的对齐程度,揭示了LLM解释的忠实性与不对称性。
Details
Motivation: 随着机器学习在敏感领域的广泛应用,对透明和可解释AI的需求日益增长。LLM能生成自然语言解释,但其是否真实反映了预测信号尚不明确。Contribution: 提出RACE框架,系统性评估LLM解释与逻辑回归特征重要性的对齐性,发现解释的正确预测与错误预测之间存在不对称性。
Method: 采用四种文本分类数据集,通过词级匹配、精确字符串匹配和编辑距离匹配技术,分析LLM解释与特征重要性的对齐程度。
Result: 实证结果显示,正确预测的解释更支持重要特征,而错误预测的解释则与矛盾特征相关。编辑距离匹配进一步揭示了释义重叠现象。
Insight: LLM解释既包含表面证据的复用,也可能在错误情况下放大误导信号,RACE为评估神经语言模型的推理完备性提供了量化基础。
Abstract: The growing adoption of machine learning (ML) in sensitive domains has heightened the demand for transparent and interpretable artificial intelligence. Large Language Models (LLMs) are increasingly capable of producing natural language explanations, yet it remains unclear whether these rationales faithfully capture the predictive signals that underlie decisions. This paper introduces RACE-Reasoning Alignment for Completeness of Explanations, a systematic framework to evaluate the alignment between LLM-generated explanations and interpretable feature importance scores derived from a logistic regression baseline. We analyze four widely used text classification datasets-WIKI ONTOLOGY, AG NEWS, IMDB, and GOEMOTIONS-and compare LLM rationales against top-ranked supporting and contradicting lexical features. To capture alignment at multiple levels of granularity, RACE implements token-aware, exact string, and edit-distance matching techniques. Empirical results reveal a consistent asymmetry: correct predictions exhibit higher coverage of supporting features, while incorrect predictions are associated with elevated coverage of contradicting features. Edit-distance matching further uncovers paraphrastic overlaps, boosting coverage while preserving this asymmetry. These findings demonstrate that LLM rationales combine both surface-level and flexible evidence reuse, yet can also amplify misleading cues in error cases. RACE provides new insights into the faithfulness of LLM explanations and establishes a quantitative basis for evaluating reasoning completeness in neural language models.
[3] Understanding Network Behaviors through Natural Language Question-Answering
Mingzhe Xing,Chang Tian,Jianan Zhang,Lichen Pan,Peipei Liu,Zhaoteng Yan,Yinliang Yue
Main category: cs.CL
TL;DR: NetMind是一个基于自然语言的网络行为理解框架,通过树状配置分块策略和统一事实图解决LLM在长上下文理解、设备异构性和复杂推理方面的挑战,并在实验中表现优于现有方法。
Details
Motivation: 现代大规模网络的复杂性增加了配置错误的风险,现有方法依赖于领域特定语言和形式模型,学习门槛高且灵活性有限。自然语言(NL)提供了更直观的接口,但LLM在处理长配置、异构性和复杂推理时仍存在挑战。Contribution: 1)提出树状配置分块策略,保持语义连贯性;2)构建统一事实图以标准化厂商特定配置;3)设计混合命令-声明式语言减轻LLM推理负担;4)贡献包含NL问答对和网络配置的基准数据集。
Method: NetMind采用三步方法:1)树状分块处理长配置;2)构建统一事实图解决异构性;3)混合语言设计提升推理精度。
Result: 实验表明,NetMind在网络行为理解任务中准确且可扩展,优于现有基线。
Insight: 结合分块策略和中间表示(统一事实图)是增强LLM在复杂网络场景中表现的有效途径,同时混合语言设计能在不牺牲灵活性的情况下提升精度。
Abstract: Modern large-scale networks introduce significant complexity in understanding network behaviors, increasing the risk of misconfiguration. Prior work proposed to understand network behaviors by mining network configurations, typically relying on domain-specific languages interfaced with formal models. While effective, they suffer from a steep learning curve and limited flexibility. In contrast, natural language (NL) offers a more accessible and interpretable interface, motivating recent research on NL-guided network behavior understanding. Recent advances in large language models (LLMs) further enhance this direction, leveraging their extensive prior knowledge of network concepts and strong reasoning capabilities. However, three key challenges remain: 1) numerous router devices with lengthy configuration files challenge LLM’s long-context understanding ability; 2) heterogeneity across devices and protocols impedes scalability; and 3) complex network topologies and protocols demand advanced reasoning abilities beyond the current capabilities of LLMs. To tackle the above challenges, we propose NetMind, a novel framework for querying networks using NL. Our approach introduces a tree-based configuration chunking strategy to preserve semantic coherence while enabling efficient partitioning. We then construct a unified fact graph as an intermediate representation to normalize vendor-specific configurations. Finally, we design a hybrid imperative-declarative language to reduce the reasoning burden on LLMs and enhance precision. We contribute a benchmark consisting of NL question-answer pairs paired with network configurations. Experiments demonstrate that NetMind achieves accurate and scalable network behavior understanding, outperforming existing baselines.
[4] Deep Literature Survey Automation with an Iterative Workflow
Hongbo Zhang,Han Cui,Yidong Wang,Yijian Tian,Qi Guo,Cunxiang Wang,Jian Wu,Chiyu Song,Yue Zhang
Main category: cs.CL
TL;DR: 这篇论文提出了一种基于迭代工作流程的自动化文献综述框架(IterSurvey),通过动态大纲生成和多模态元素整合,显著提升了综述的质量和可读性。
Details
Motivation: 现有的自动化文献综述系统多采用一次性检索和静态大纲生成的范式,导致检索噪声大、结构碎片化和上下文过载,影响了综述质量。为解决这些问题,论文受人类研究者迭代阅读过程的启发,提出了新的框架。Contribution: 主要贡献包括:(1)提出了IterSurvey框架,通过动态大纲生成和增量式检索提升综述质量;(2)设计了论文卡片(paper cards)和多模态元素整合机制;(3)引入了Survey-Arena评测基准,更可靠地评估机器生成和人工撰写综述的质量差异。
Method: 方法基于迭代工作流程:(1)规划智能体动态检索和更新大纲;(2)论文卡片提取每篇论文的核心内容(贡献、方法和发现);(3)通过可视化增强的多模态元素整合优化文本流。
Result: 实验表明,IterSurvey在内容覆盖、结构连贯性和引用质量上优于现有基线,同时生成的综述更易读和组织性更强。Survey-Arena基准进一步验证了这些改进的可靠性。
Insight: 论文揭示了动态大纲生成和多模态整合对自动化文献综述的重要性,同时强调了评测基准在评估质量提升中的关键作用。
Abstract: Automatic literature survey generation has attracted increasing attention, yet most existing systems follow a one-shot paradigm, where a large set of papers is retrieved at once and a static outline is generated before drafting. This design often leads to noisy retrieval, fragmented structures, and context overload, ultimately limiting survey quality. Inspired by the iterative reading process of human researchers, we propose \ours, a framework based on recurrent outline generation, in which a planning agent incrementally retrieves, reads, and updates the outline to ensure both exploration and coherence. To provide faithful paper-level grounding, we design paper cards that distill each paper into its contributions, methods, and findings, and introduce a review-and-refine loop with visualization enhancement to improve textual flow and integrate multimodal elements such as figures and tables. Experiments on both established and emerging topics show that \ours\ substantially outperforms state-of-the-art baselines in content coverage, structural coherence, and citation quality, while producing more accessible and better-organized surveys. To provide a more reliable assessment of such improvements, we further introduce Survey-Arena, a pairwise benchmark that complements absolute scoring and more clearly positions machine-generated surveys relative to human-written ones. The code is available at https://github.com/HancCui/IterSurvey\_Autosurveyv2.
[5] Penalizing Length: Uncovering Systematic Bias in Quality Estimation Metrics
Yilin Zhang,Wenda Xu,Zhongtao Liu,Tetsuji Nakagawa,Markus Freitag
Main category: cs.CL
TL;DR: 该论文揭示了质量评估(QE)指标在机器翻译中存在的系统性长度偏差问题,并提出了两种缓解策略。
Details
Motivation: 质量评估(QE)指标在机器翻译中常被用于无参考评估和作为强化学习的奖励信号,但其对翻译长度的系统性偏差及其影响尚未充分研究。Contribution: 研究发现QE指标普遍存在对较长翻译过度预测错误的倾向和对短翻译的偏好,并提出长度归一化和引入参考文本两种方法来减少这种偏差。
Method: 论文系统地分析了10种不同语言对中表现最佳的回归模型和LLM-as-a-Judge QE指标,并提出在训练时进行长度归一化以及在评估时引入参考文本来缓解长度偏差。
Result: 实验表明,这两种策略能有效减少QE指标中的长度偏差。
Insight: 研究表明,QE指标的长度偏差可能导致对较长高质量翻译的不公平惩罚,进而影响其在重排序和强化学习中的应用效果。
Abstract: Quality Estimation (QE) metrics are vital in machine translation for reference-free evaluation and as a reward signal in tasks like reinforcement learning. However, the prevalence and impact of length bias in QE have been underexplored. Through a systematic study of top-performing regression-based and LLM-as-a-Judge QE metrics across 10 diverse language pairs, we reveal two critical length biases: First, QE metrics consistently over-predict errors with increasing translation length, even for high-quality, error-free texts. Second, they exhibit a preference for shorter translations when multiple candidates are available for the same source text. These inherent length biases risk unfairly penalizing longer, correct translations and can lead to sub-optimal decision-making in applications such as QE reranking and QE guided reinforcement learning. To mitigate this, we propose two strategies: (a) applying length normalization during model training, and (b) incorporating reference texts during evaluation. Both approaches were found to effectively reduce the identified length bias.
[6] ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality
Shayne Longpre,Sneha Kudugunta,Niklas Muennighoff,I-Hung Hsu,Isaac Caswell,Alex Pentland,Sercan Arik,Chen-Yu Lee,Sayna Ebrahimi
Main category: cs.CL
TL;DR: 论文通过774次多语言训练实验,提出自适应迁移缩放定律(ATLAS),研究发现显著优于现有缩放定律,揭示了多语言学习的动态特性、语言间迁移效果及多语言诅咒的解决方案。
Details
Motivation: 现有缩放定律研究主要集中于英语,缺乏对多语言场景的支持,无法满足全球数十亿非英语用户的需求。Contribution: 1. 提出ATLAS定律,显著提升样本外泛化能力;2. 推导了38x38语言的交叉迁移矩阵;3. 提出语言无关的缩放定律;4. 确定了从头训练与微调的算力交叉点。
Method: 通过774次实验(10M-8B参数量、400+训练语言、48评估语言),分析多语言学习动态和迁移特性,提出自适应迁移缩放定律。
Result: ATLAS定律的样本外泛化能力优于现有方法(R^2提升0.3以上),揭示了语言间迁移的最优策略。
Insight: 多语言模型可通过优化模型大小和数据分配避免性能下降,同时从头训练与微调的选择需基于算力成本权衡。
Abstract: Scaling laws research has focused overwhelmingly on English – yet the most prominent AI models explicitly serve billions of international users. In this work, we undertake the largest multilingual scaling laws study to date, totaling 774 multilingual training experiments, spanning 10M-8B model parameters, 400+ training languages and 48 evaluation languages. We introduce the Adaptive Transfer Scaling Law (ATLAS) for both monolingual and multilingual pretraining, which outperforms existing scaling laws’ out-of-sample generalization often by more than 0.3 R^2. Our analyses of the experiments shed light on multilingual learning dynamics, transfer properties between languages, and the curse of multilinguality. First, we derive a cross-lingual transfer matrix, empirically measuring mutual benefit scores between 38 x 38=1444 language pairs. Second, we derive a language-agnostic scaling law that reveals how to optimally scale model size and data when adding languages without sacrificing performance. Third, we identify the computational crossover points for when to pretrain from scratch versus finetune from multilingual checkpoints. We hope these findings provide the scientific foundation for democratizing scaling laws across languages, and enable practitioners to efficiently scale models – beyond English-first AI.
cs.CV [Back]
[7] Diagnosing Bottlenecks in Data Visualization Understanding by Vision-Language Models
Alexa R. Tartaglini,Satchel Grant,Daniel Wurgaft,Christopher Potts,Judith E. Fan
Main category: cs.CV
TL;DR: 论文开发了FUGU任务套件,用于诊断视觉语言模型(VLMs)在数据可视化理解中的瓶颈,发现错误主要源于视觉模块与语言模块之间的信息传递问题,且模型架构存在局限性。
Details
Motivation: 数据可视化是科学文章和新闻报道的重要组成部分,但当前的VLMs在基础任务上表现不佳。研究旨在明确错误来源,探索模型在视觉信息编码、信息传递或语言处理中的具体问题。Contribution: 提出了FUGU任务套件,用于精确诊断VLMs在数据可视化理解中的困难来源;通过激活修补和线性探测等技术,揭示了错误主要集中在视觉-语言信息传递环节。
Method: 使用FUGU任务套件测试三种广泛使用的VLMs;通过激活修补和线性探测分析信息流;提供修正后的坐标以验证错误来源。
Result: 研究发现,VLMs在生成单个数据点坐标时容易出错,且纠正后性能显著提升;但在涉及多数据点统计关系任务中反而表现更差;微调无法实现天花板性能。
Insight: 当前VLMs的架构限制了其数据可视化理解的可靠性,尤其是在视觉-语言信息传递和多数据点关系处理上存在瓶颈。
Abstract: Data visualizations are vital components of many scientific articles and news stories. Current vision-language models (VLMs) still struggle on basic data visualization understanding tasks, but the causes of failure remain unclear. Are VLM failures attributable to limitations in how visual information in the data visualization is encoded, how information is transferred between the vision and language modules, or how information is processed within the language module? We developed FUGU, a suite of data visualization understanding tasks, to precisely characterize potential sources of difficulty (e.g., extracting the position of data points, distances between them, and other summary statistics). We used FUGU to investigate three widely used VLMs. To diagnose the sources of errors produced by these models, we used activation patching and linear probes to trace information flow through models across a variety of prompting strategies. We found that some models fail to generate the coordinates of individual data points correctly, and these initial errors often lead to erroneous final responses. When these models are provided with the correct coordinates, performance improves substantially. Moreover, even when the model generates an incorrect response, the correct coordinates can be successfully read out from the latent representations in the vision encoder, suggesting that the source of these errors lies in the vision-language handoff. We further found that while providing correct coordinates helps with tasks involving one or a small number of data points, it generally worsens performance for tasks that require extracting statistical relationships across many data points. Fine-tuning models on FUGU also fails to yield ceiling performance. These findings point to architectural constraints in current VLMs that might pose significant challenges for reliable data visualization understanding.
[8] Agro-Consensus: Semantic Self-Consistency in Vision-Language Models for Crop Disease Management in Developing Countries
Mihir Gupta,Pratik Desai,Ross Greer
Main category: cs.CV
TL;DR: 该论文提出了一种低成本的自一致性框架(Agro-Consensus),通过语义聚类和共识机制提升视觉语言模型(VLM)在农业图像描述任务中的可靠性,适用于发展中国家农作物病害管理。
Details
Motivation: 发展中国家农作物病害管理面临专家资源匮乏、网络不稳定和高成本等问题,现有的AI系统部署受限。论文旨在设计一种低成本且可靠的AI框架,提升农业诊断的准确性。Contribution: 主要贡献包括:1)提出基于语义聚类的自一致性框架,通过轻量级嵌入模型(80MB)生成候选响应并筛选最佳描述;2)引入人机协同(HITL)机制,通过用户确认作物类型过滤错误输出;3)在PlantVillage数据集上验证框架有效性,显著优于基线方法。
Method: 方法核心包括:1)使用预训练嵌入模型对多个候选描述进行语义聚类;2)通过余弦相似度共识选择最一致的描述(覆盖诊断、症状、分析和建议);3)结合HITL机制提升输入质量。实验使用3B参数的PaliGemma模型生成候选描述。
Result: 在800张农作物病害图像上的实验表明:1)单聚类共识方法在10个候选描述时达到83.1%准确率(基线为77.5%);2)多聚类共识(前四聚类)准确率提升至94.0%(基线为88.5%)。
Insight: 语义自一致性机制能显著提升VLM在资源受限场景中的可靠性,HITL设计进一步减少错误传播,为发展中国家农业AI应用提供了实用解决方案。
Abstract: Agricultural disease management in developing countries such as India, Kenya, and Nigeria faces significant challenges due to limited access to expert plant pathologists, unreliable internet connectivity, and cost constraints that hinder the deployment of large-scale AI systems. This work introduces a cost-effective self-consistency framework to improve vision-language model (VLM) reliability for agricultural image captioning. The proposed method employs semantic clustering, using a lightweight (80MB) pre-trained embedding model to group multiple candidate responses. It then selects the most coherent caption – containing a diagnosis, symptoms, analysis, treatment, and prevention recommendations – through a cosine similarity-based consensus. A practical human-in-the-loop (HITL) component is incorporated, wherein user confirmation of the crop type filters erroneous generations, ensuring higher-quality input for the consensus mechanism. Applied to the publicly available PlantVillage dataset using a fine-tuned 3B-parameter PaliGemma model, our framework demonstrates improvements over standard decoding methods. Evaluated on 800 crop disease images with up to 21 generations per image, our single-cluster consensus method achieves a peak accuracy of 83.1% with 10 candidate generations, compared to the 77.5% baseline accuracy of greedy decoding. The framework’s effectiveness is further demonstrated when considering multiple clusters; accuracy rises to 94.0% when a correct response is found within any of the top four candidate clusters, outperforming the 88.5% achieved by a top-4 selection from the baseline.
[9] Proportion and Perspective Control for Flow-Based Image Generation
Julien Boudier,Hugo Caselles-Dupré
Main category: cs.CV
TL;DR: 这篇论文提出了两种ControlNet模块:比例ControlNet和透视ControlNet,分别通过边界框和消失线来控制图像生成的空间和几何结构,提升了文本到图像模型的操控能力。
Details
Motivation: 现代文本到图像扩散模型生成高保真图像,但在空间和几何结构的控制上有限。需要更精确的工具来满足艺术创作需求。Contribution: 1. 提出了比例ControlNet,通过边界框控制物体位置和大小;2. 提出了透视ControlNet,利用消失线控制场景的3D几何。
Method: 使用数据流水线训练ControlNet模块,结合视觉语言模型进行标注,并设计专门的算法用于条件图像合成。
Result: 实验表明两种模块能有效控制图像生成,但在复杂约束下仍有局限性。
Insight: 通过边界框和几何约束可以显著提升图像生成的可控性,但在复杂场景中仍需进一步优化。
Abstract: While modern text-to-image diffusion models generate high-fidelity images, they offer limited control over the spatial and geometric structure of the output. To address this, we introduce and evaluate two ControlNets specialized for artistic control: (1) a proportion ControlNet that uses bounding boxes to dictate the position and scale of objects, and (2) a perspective ControlNet that employs vanishing lines to control the 3D geometry of the scene. We support the training of these modules with data pipelines that leverage vision-language models for annotation and specialized algorithms for conditioning image synthesis. Our experiments demonstrate that both modules provide effective control but exhibit limitations with complex constraints. Both models are released on HuggingFace: https://huggingface.co/obvious-research
[10] H2OFlow: Grounding Human-Object Affordances with 3D Generative Models and Dense Diffused Flows
Harry Zhang,Luca Carlone
Main category: cs.CV
TL;DR: H2OFlow提出了一种基于3D生成模型和稠密扩散流的框架,用于学习人类-物体交互(HOI)的三维功能性(affordance),涵盖接触、方向和空间占用,无需人工标注。
Details
Motivation: 现有方法依赖手工标记的数据集,忽视了交互中的方向和空间占用问题,且成本高昂。H2OFlow旨在通过合成数据解决这些问题。Contribution: H2OFlow是首个全面学习3D HOI功能性(affordance)的框架,利用合成数据和稠密扩散流表示,避免了人工标注。
Method: 该方法基于3D生成模型生成合成数据,并通过点云上的稠密扩散过程学习3D流表示。
Result: 实验表明,H2OFlow在真实物体上表现优异,优于依赖手工标注或网格表示的方法。
Insight: 合成数据和扩散流可以是学习复杂3D交互功能性的有效替代方案,避免了数据标注的瓶颈。
Abstract: Understanding how humans interact with the surrounding environment, and specifically reasoning about object interactions and affordances, is a critical challenge in computer vision, robotics, and AI. Current approaches often depend on labor-intensive, hand-labeled datasets capturing real-world or simulated human-object interaction (HOI) tasks, which are costly and time-consuming to produce. Furthermore, most existing methods for 3D affordance understanding are limited to contact-based analysis, neglecting other essential aspects of human-object interactions, such as orientation (\eg, humans might have a preferential orientation with respect certain objects, such as a TV) and spatial occupancy (\eg, humans are more likely to occupy certain regions around an object, like the front of a microwave rather than its back). To address these limitations, we introduce \emph{H2OFlow}, a novel framework that comprehensively learns 3D HOI affordances – encompassing contact, orientation, and spatial occupancy – using only synthetic data generated from 3D generative models. H2OFlow employs a dense 3D-flow-based representation, learned through a dense diffusion process operating on point clouds. This learned flow enables the discovery of rich 3D affordances without the need for human annotations. Through extensive quantitative and qualitative evaluations, we demonstrate that H2OFlow generalizes effectively to real-world objects and surpasses prior methods that rely on manual annotations or mesh-based representations in modeling 3D affordance.
[11] OCR-Quality: A Human-Annotated Dataset for OCR Quality Assessment
Yulong Zhang
Main category: cs.CV
TL;DR: 论文提出了OCR-Quality,这是一个用于评估和开发OCR质量评估方法的人工标注数据集,包含1000个PDF页面转换的PNG图像,涵盖多种真实场景,并提供4级质量评分。
Details
Motivation: OCR技术在真实场景中的质量评估缺乏可靠的数据集,限制了OCR验证系统的开发和改进。Contribution: 贡献了OCR-Quality数据集,包括多样化的文档样本和人工标注的质量评分,填补了OCR质量评估领域的空白。
Method: 数据集基于300 DPI的PNG图像,采用最先进的视觉语言模型(VLMs)处理,并通过人工标注提供了4级质量评分系统。
Result: OCR-Quality数据集为OCR质量评估提供了基准,并公开可用,支持OCR验证系统的训练和评估。
Insight: 该数据集的设计考虑了多样性和实际需求,强调了人工标注的重要性,为OCR技术的进一步研究提供了基础。
Abstract: We present OCR-Quality, a comprehensive human-annotated dataset designed for evaluating and developing OCR quality assessment methods. The dataset consists of 1,000 PDF pages converted to PNG images at 300 DPI, sampled from diverse real-world scenarios, including academic papers, textbooks, e-books, and multilingual documents. Each document has been processed using state-of-the-art Vision-Language Models (VLMs) and manually annotated with quality scores using a 4-level scoring system (1: Excellent, 2: Good, 3: Fair, 4: Poor). The dataset includes detailed source information, annotation guidelines, and representative cases across various difficulty levels. OCR-Quality addresses the critical need for reliable OCR quality assessment in real-world applications and provides a valuable benchmark for training and evaluating OCR verification systems. The dataset is publicly available at https://huggingface.co/datasets/Aslan-mingye/OCR-Quality .
[12] Face-MakeUpV2: Facial Consistency Learning for Controllable Text-to-Image Generation
Dawei Dai,Yinxiu Zhou,Chenghang Li,Guolai Jiang,Chengfang Zhang
Main category: cs.CV
TL;DR: Face-MakeUpV2是一个新的文本到图像生成模型,专注于解决面部图像生成中的属性泄露和物理一致性问题。通过构建大规模数据集和引入双重面部信息注入通道,结合优化目标,实现了面部ID和物理特性的高一致性。
Details
Motivation: 当前文本到图像模型在响应局部语义指令时存在面部属性泄露和物理一致性问题,导致生成的面部图像缺乏真实性和可控性。Contribution: 1. 构建了包含一百万个图像-文本-掩码对的大规模数据集FaceCaptionMask-1M。2. 提出了双重面部信息注入通道(3D面部渲染通道和全局面部特征通道)。3. 设计了语义对齐和感知损失两个优化目标。
Method: 1. 使用通用文本到图像预训练模型作为主干。2. 引入3D面部渲染通道和全局面部特征通道。3. 通过语义对齐和感知损失优化模型。
Result: Face-MakeUpV2在保持面部ID和物理一致性方面表现最佳,展示了其在可靠和可控面部编辑中的潜力。
Insight: 大规模数据集和双重信息注入通道的结合是解决文本到图像生成中面部一致性问题的有效途径。
Abstract: In facial image generation, current text-to-image models often suffer from facial attribute leakage and insufficient physical consistency when responding to local semantic instructions. In this study, we propose Face-MakeUpV2, a facial image generation model that aims to maintain the consistency of face ID and physical characteristics with the reference image. First, we constructed a large-scale dataset FaceCaptionMask-1M comprising approximately one million image-text-masks pairs that provide precise spatial supervision for the local semantic instructions. Second, we employed a general text-to-image pretrained model as the backbone and introduced two complementary facial information injection channels: a 3D facial rendering channel to incorporate the physical characteristics of the image and a global facial feature channel. Third, we formulated two optimization objectives for the supervised learning of our model: semantic alignment in the model’s embedding space to mitigate the attribute leakage problem and perceptual loss on facial images to preserve ID consistency. Extensive experiments demonstrated that our Face-MakeUpV2 achieves best overall performance in terms of preserving face ID and maintaining physical consistency of the reference images. These results highlight the practical potential of Face-MakeUpV2 for reliable and controllable facial editing in diverse applications.
[13] Ageing Drift in Binary Face Templates: A Bits-per-Decade Analysis
Abdelilah Ganmati,Karim Afdel,Lahcen Koutti
Main category: cs.CV
TL;DR: 该论文研究了紧凑二进制人脸模板的长期稳定性,量化了年龄漂移(以每十年比特数为单位),并通过实验验证了码长与漂移的关系及其对实际部署的影响。
Details
Motivation: 研究人脸模板在长期使用中的年龄漂移现象,尤其是二进制编码模板的稳定性问题,为智能卡和卡上匹配(match-on-card)等实际应用提供理论基础和改进建议。Contribution: 1. 首次直接以比特/十年为单位量化人脸模板的年龄漂移;2. 揭示了码长与漂移的正相关关系;3. 提供了针对不稳定比特位的简单缓解方案(如定期重新注册和目标奇偶校验)。
Method: 通过将现代人脸CNN生成的浮点嵌入压缩为64位和128位的PCA-ITQ二进制编码,对AgeDB数据集中每个身份的多个年龄样本拟合Hamming距离与年龄差的线性模型,统计分析了566个身份的漂移分布。
Result: 实验表明,64位和128位模板的中位漂移分别为1.357比特/十年和2.571比特/十年;漂移分布主要为正值,表明类内距离随时间增加;漂移与码长成正比,短码更稳定。
Insight: 1. 短码具有更好的年龄稳定性;2. 漂移现象普遍存在,需通过重新注册或优化比特稳定位来缓解;3. 实验结果支持了智能卡等低存储场景的实际部署策略。
Abstract: We study the longitudinal stability of compact binary face templates and quantify ageing drift directly in bits per decade. Float embeddings from a modern face CNN are compressed with PCA-ITQ into 64- and 128-bit codes. For each identity in AgeDB with at least three distinct ages, we form all genuine pairs and fit a per-identity linear model of Hamming distance versus absolute age gap. Across 566 identities, the median slope is 1.357 bits per decade for 64-bit templates and 2.571 bits per decade for 128-bit templates, with tight non-parametric 95 percent bootstrap confidence intervals. The distributions are predominantly positive, indicating a small but systematic increase in intra-class distance over time. Because drift scales with code length, shorter codes are inherently more age-stable at a fixed decision threshold. We connect these slopes to operating characteristics by reporting EER and TPR at FAR = 1 percent in three age bins. We discuss implications for smart-card and match-on-card deployments, including simple mitigations such as periodic re-enrolment and targeted parity on empirically unstable bit positions. Code and CSV artifacts are provided to support reproducibility.
[14] Bridging Accuracy and Interpretability: Deep Learning with XAI for Breast Cancer Detection
Bishal Chhetri,B. V. Rathish Kumar
Main category: cs.CV
TL;DR: 该研究提出了一种可解释的深度学习框架,用于乳腺癌的早期检测,结合了高准确率和可解释AI技术,显著提升了分类性能并增强了临床医生的信任。
Details
Motivation: 传统的深度学习模型虽然在乳腺癌检测中表现优异,但其黑箱特性阻碍了临床应用,因此需要结合可解释AI技术以提升透明度和可接受度。Contribution: 提出了一种高性能且可解释的深度学习框架,结合SHAP和LIME技术,显著提升了分类准确率并为临床决策提供了直观的解释。
Method: 使用带有ReLU激活、Adam优化器和二元交叉熵损失的深度神经网络,结合SHAP和LIME技术生成特征级别的解释和可视化。
Result: 模型在乳腺癌检测中表现卓越,准确率为0.992,F1分数为0.988,优于多种传统算法,并通过可解释技术识别出关键特征(细胞核的凹点)。
Insight: 细胞核的凹点是影响分类任务的最重要特征,这一发现有助于改进乳腺癌的诊断和治疗。
Abstract: In this study, we present an interpretable deep learning framework for the early detection of breast cancer using quantitative features extracted from digitized fine needle aspirate (FNA) images of breast masses. Our deep neural network, using ReLU activations, the Adam optimizer, and a binary cross-entropy loss, delivers state-of-the-art classification performance, achieving an accuracy of 0.992, precision of 1.000, recall of 0.977, and an F1 score of 0.988. These results substantially exceed the benchmarks reported in the literature. We evaluated the model under identical protocols against a suite of well-established algorithms (logistic regression, decision trees, random forests, stochastic gradient descent, K-nearest neighbors, and XGBoost) and found the deep model consistently superior on the same metrics. Recognizing that high predictive accuracy alone is insufficient for clinical adoption due to the black-box nature of deep learning models, we incorporated model-agnostic Explainable AI techniques such as SHAP and LIME to produce feature-level attributions and human-readable visualizations. These explanations quantify the contribution of each feature to individual predictions, support error analysis, and increase clinician trust, thus bridging the gap between performance and interpretability for real-world clinical use. The concave points feature of the cell nuclei is found to be the most influential feature positively impacting the classification task. This insight can be very helpful in improving the diagnosis and treatment of breast cancer by highlighting the key characteristics of breast tumor.
[15] EdgeSync: Accelerating Edge-Model Updates for Data Drift through Adaptive Continuous Learning
Runchu Donga,Peng Zhao,Guiqin Wang,Nan Qi,Jie Lin
Main category: cs.CV
TL;DR: EdgeSync是一种高效的边缘模型更新方法,通过自适应持续学习和动态训练管理,解决了边缘设备上因数据漂移导致模型精度下降的问题,显著提高了更新的时效性和准确性。
Details
Motivation: 实时视频分析系统中,数据特征的分布可能随时间变化(如光照和天气条件),导致边缘设备上的轻量级模型精度下降。现有方法在模型更新时存在计算延迟和与新数据分布不匹配的问题。Contribution: EdgeSync通过改进样本过滤(结合时效性和推理结果)和动态训练管理模块,优化了模型更新的时效性和准确性。
Method: EdgeSync提出了一种自适应持续学习方法,结合样本过滤的动态管理和优化的更新时序选择,确保训练样本更贴合当前视频内容并减少延迟。
Result: 实验表明,EdgeSync在复杂真实数据集上比现有方法准确率提升了约3.4%,比传统方法提升了约10%。
Insight: EdgeSync通过动态管理和优化更新时序,有效平衡了计算负载与模型时效性,为边缘计算中的数据漂移问题提供了高效解决方案。
Abstract: Real-time video analytics systems typically deploy lightweight models on edge devices to reduce latency. However, the distribution of data features may change over time due to various factors such as changing lighting and weather conditions, leading to decreased model accuracy. Recent frameworks try to address this issue by leveraging remote servers to continuously train and adapt lightweight edge models using more complex models in the cloud. Despite these advancements, existing methods face two key challenges: first, the retraining process is compute-intensive, causing significant delays in model updates; second, the new model may not align well with the evolving data distribution of the current video stream. To address these challenges, we introduce EdgeSync, an efficient edge-model updating approach that enhances sample filtering by incorporating timeliness and inference results, thus ensuring training samples are more relevant to the current video content while reducing update delays. Additionally, EdgeSync features a dynamic training management module that optimizes the timing and sequencing of model updates to improve their timeliness. Evaluations on diverse and complex real-world datasets demonstrate that EdgeSync improves accuracy by approximately 3.4% compared to existing methods and by about 10% compared to traditional approaches.
[16] Promptable Fire Segmentation: Unleashing SAM2’s Potential for Real-Time Mobile Deployment with Strategic Bounding Box Guidance
Emmanuel U. Ugwu,Zhang Xinming
Main category: cs.CV
TL;DR: 该论文首次全面评估了SAM2变体在火焰分割任务中的性能,重点研究了边界框提示策略对移动部署可行性的提升。实验表明,边界框提示策略在火焰分割中表现最优,尤其是结合多点提示的Box+MP方法,同时轻量级变体(如TinySAM和MobileSAM)更适合低资源边缘场景。
Details
Motivation: 火焰分割在计算机视觉中是一个具有挑战性的任务,因为火焰具有不规则边界、半透明边缘和高度变化的强度。尽管SAM和SAM2展示了卓越的跨领域泛化能力,但它们在火焰分割任务中的表现,尤其是在移动部署场景下的应用潜力尚未被充分探索。Contribution: 1. 首次全面评估SAM2变体在火焰分割任务中的性能;2. 提出多种提示策略(如边界框和多点提示)以提升分割效果;3. 验证轻量级变体(TinySAM和MobileSAM)在边缘计算场景中的实用性;4. 为领域专用SAM应用建立基准。
Method: 论文系统评估了四种SAM2.1变体(tiny、small、base_plus、large)和两种移动优化变体(TinySAM、MobileSAM),并对比了多种提示策略(自动、单点、多点、边界框及其组合)的性能。
Result: 边界框提示策略表现最佳,其中Box+MP在Khan数据集上取得了最高的平均IoU(0.64)和Dice系数(0.75)。轻量级变体显著降低了计算和内存开销,更适合边缘部署。
Insight: 1. 边界框提示策略在火焰分割中更为有效;2. 轻量级模型在计算资源受限的场景中具有实用价值;3. 多点提示可以进一步提升分割精度。
Abstract: Fire segmentation remains a critical challenge in computer vision due to flames’ irregular boundaries, translucent edges, and highly variable intensities. While the Segment Anything Models (SAM and SAM2) have demonstrated impressive cross-domain generalization capabilities, their effectiveness in fire segmentation – particularly under mobile deployment constraints – remains largely unexplored. This paper presents the first comprehensive evaluation of SAM2 variants for fire segmentation, focusing on bounding box prompting strategies to enhance deployment feasibility. We systematically evaluate four SAM2.1 variants (tiny, small, base_plus, large) alongside mobile-oriented variants (TinySAM, MobileSAM) across three fire datasets using multiple prompting strategies: automatic, single positive point (SP), single positive point + single negative point (SP+SN), multiple positive points (MP), bounding box (Box), and hybrid variants (Box+SP and Box+MP). Our experimental results demonstrate that bounding box prompts consistently outperform automatic and single point-based approaches, with Box+MP achieving the highest mean IoU (0.64) and Dice coefficient (0.75) on the Khan dataset. Lightweight variants such as TinySAM and MobileSAM further reduce memory and computational costs, making them more suitable for latency-tolerant edge scenarios. Overall, this work provides critical insights for deploying promptable segmentation models in fire monitoring systems and establishes benchmarks for future research in domain-specific SAM applications. Code is available at: https://github.com/UEmmanuel5/ProFSAM
[17] Multi-Agent Pose Uncertainty: A Differentiable Rendering Cramér-Rao Bound
Arun Muthukkumar
Main category: cs.CV
TL;DR: 该论文提出了一种基于可微渲染器的方法,推导出相机位姿估计协方差的闭式下界,并通过线性化图像形成过程在多智能体场景中扩展应用。
Details
Motivation: 尽管位姿估计在计算机视觉和机器人领域广泛应用,但很少有工作能够对密集或学习模型下的位姿进行严格的不确定性量化。本文旨在填补这一空白。Contribution: 1. 提出了一个闭式的下界,用于量化相机位姿估计的协方差;2. 将可微渲染器作为测量函数,扩展了现有理论;3. 在多智能体场景中通过融合Fisher信息实现了不确定性量化。
Method: 通过线性化图像形成过程,将小位姿扰动建模为流形上的问题,提出了一种结合可微渲染器的Cramér-Rao下界方法。
Result: 该方法不仅与传统的光束平差不确定性量化一致,还在多智能体协作感知等任务中展现了应用潜力。
Insight: 该工作为位姿估计的不确定性提供了一种新的理论框架,尤其是在密集模型和多智能体系统中具有广泛的应用前景。
Abstract: Pose estimation is essential for many applications within computer vision and robotics. Despite its uses, few works provide rigorous uncertainty quantification for poses under dense or learned models. We derive a closed-form lower bound on the covariance of camera pose estimates by treating a differentiable renderer as a measurement function. Linearizing image formation with respect to a small pose perturbation on the manifold yields a render-aware Cram'er-Rao bound. Our approach reduces to classical bundle-adjustment uncertainty, ensuring continuity with vision theory. It also naturally extends to multi-agent settings by fusing Fisher information across cameras. Our statistical formulation has downstream applications for tasks such as cooperative perception and novel view synthesis without requiring explicit keypoint correspondences.