Table of Contents

cs.CV [Back]

[1] Diagnosing Bottlenecks in Data Visualization Understanding by Vision-Language Models cs.CV | cs.AI | cs.CLPDF

Alexa R. Tartaglini, Satchel Grant, Daniel Wurgaft, Christopher Potts, Judith E. Fan

TL;DR: 论文开发了FUGU任务套件,用于诊断视觉语言模型(VLMs)在数据可视化理解中的瓶颈,发现错误主要源于视觉模块与语言模块之间的信息传递问题,且模型架构存在局限性。

Details

Motivation: 数据可视化是科学文章和新闻报道的重要组成部分,但当前的VLMs在基础任务上表现不佳。研究旨在明确错误来源,探索模型在视觉信息编码、信息传递或语言处理中的具体问题。

Result: 研究发现,VLMs在生成单个数据点坐标时容易出错,且纠正后性能显著提升;但在涉及多数据点统计关系任务中反而表现更差;微调无法实现天花板性能。

Insight: 当前VLMs的架构限制了其数据可视化理解的可靠性,尤其是在视觉-语言信息传递和多数据点关系处理上存在瓶颈。

Abstract: Data visualizations are vital components of many scientific articles and news stories. Current vision-language models (VLMs) still struggle on basic data visualization understanding tasks, but the causes of failure remain unclear. Are VLM failures attributable to limitations in how visual information in the data visualization is encoded, how information is transferred between the vision and language modules, or how information is processed within the language module? We developed FUGU, a suite of data visualization understanding tasks, to precisely characterize potential sources of difficulty (e.g., extracting the position of data points, distances between them, and other summary statistics). We used FUGU to investigate three widely used VLMs. To diagnose the sources of errors produced by these models, we used activation patching and linear probes to trace information flow through models across a variety of prompting strategies. We found that some models fail to generate the coordinates of individual data points correctly, and these initial errors often lead to erroneous final responses. When these models are provided with the correct coordinates, performance improves substantially. Moreover, even when the model generates an incorrect response, the correct coordinates can be successfully read out from the latent representations in the vision encoder, suggesting that the source of these errors lies in the vision-language handoff. We further found that while providing correct coordinates helps with tasks involving one or a small number of data points, it generally worsens performance for tasks that require extracting statistical relationships across many data points. Fine-tuning models on FUGU also fails to yield ceiling performance. These findings point to architectural constraints in current VLMs that might pose significant challenges for reliable data visualization understanding.


[2] Agro-Consensus: Semantic Self-Consistency in Vision-Language Models for Crop Disease Management in Developing Countries cs.CVPDF

Mihir Gupta, Pratik Desai, Ross Greer

TL;DR: 该论文提出了一种低成本的自一致性框架(Agro-Consensus),通过语义聚类和共识机制提升视觉语言模型(VLM)在农业图像描述任务中的可靠性,适用于发展中国家农作物病害管理。

Details

Motivation: 发展中国家农作物病害管理面临专家资源匮乏、网络不稳定和高成本等问题,现有的AI系统部署受限。论文旨在设计一种低成本且可靠的AI框架,提升农业诊断的准确性。

Result: 在800张农作物病害图像上的实验表明:1)单聚类共识方法在10个候选描述时达到83.1%准确率(基线为77.5%);2)多聚类共识(前四聚类)准确率提升至94.0%(基线为88.5%)。

Insight: 语义自一致性机制能显著提升VLM在资源受限场景中的可靠性,HITL设计进一步减少错误传播,为发展中国家农业AI应用提供了实用解决方案。

Abstract: Agricultural disease management in developing countries such as India, Kenya, and Nigeria faces significant challenges due to limited access to expert plant pathologists, unreliable internet connectivity, and cost constraints that hinder the deployment of large-scale AI systems. This work introduces a cost-effective self-consistency framework to improve vision-language model (VLM) reliability for agricultural image captioning. The proposed method employs semantic clustering, using a lightweight (80MB) pre-trained embedding model to group multiple candidate responses. It then selects the most coherent caption – containing a diagnosis, symptoms, analysis, treatment, and prevention recommendations – through a cosine similarity-based consensus. A practical human-in-the-loop (HITL) component is incorporated, wherein user confirmation of the crop type filters erroneous generations, ensuring higher-quality input for the consensus mechanism. Applied to the publicly available PlantVillage dataset using a fine-tuned 3B-parameter PaliGemma model, our framework demonstrates improvements over standard decoding methods. Evaluated on 800 crop disease images with up to 21 generations per image, our single-cluster consensus method achieves a peak accuracy of 83.1% with 10 candidate generations, compared to the 77.5% baseline accuracy of greedy decoding. The framework’s effectiveness is further demonstrated when considering multiple clusters; accuracy rises to 94.0% when a correct response is found within any of the top four candidate clusters, outperforming the 88.5% achieved by a top-4 selection from the baseline.


[3] Proportion and Perspective Control for Flow-Based Image Generation cs.CV | cs.AIPDF

Julien Boudier, Hugo Caselles-Dupré

TL;DR: 这篇论文提出了两种ControlNet模块:比例ControlNet和透视ControlNet,分别通过边界框和消失线来控制图像生成的空间和几何结构,提升了文本到图像模型的操控能力。

Details

Motivation: 现代文本到图像扩散模型生成高保真图像,但在空间和几何结构的控制上有限。需要更精确的工具来满足艺术创作需求。

Result: 实验表明两种模块能有效控制图像生成,但在复杂约束下仍有局限性。

Insight: 通过边界框和几何约束可以显著提升图像生成的可控性,但在复杂场景中仍需进一步优化。

Abstract: While modern text-to-image diffusion models generate high-fidelity images, they offer limited control over the spatial and geometric structure of the output. To address this, we introduce and evaluate two ControlNets specialized for artistic control: (1) a proportion ControlNet that uses bounding boxes to dictate the position and scale of objects, and (2) a perspective ControlNet that employs vanishing lines to control the 3D geometry of the scene. We support the training of these modules with data pipelines that leverage vision-language models for annotation and specialized algorithms for conditioning image synthesis. Our experiments demonstrate that both modules provide effective control but exhibit limitations with complex constraints. Both models are released on HuggingFace: https://huggingface.co/obvious-research


[4] H2OFlow: Grounding Human-Object Affordances with 3D Generative Models and Dense Diffused Flows cs.CVPDF

Harry Zhang, Luca Carlone

TL;DR: H2OFlow提出了一种基于3D生成模型和稠密扩散流的框架,用于学习人类-物体交互(HOI)的三维功能性(affordance),涵盖接触、方向和空间占用,无需人工标注。

Details

Motivation: 现有方法依赖手工标记的数据集,忽视了交互中的方向和空间占用问题,且成本高昂。H2OFlow旨在通过合成数据解决这些问题。

Result: 实验表明,H2OFlow在真实物体上表现优异,优于依赖手工标注或网格表示的方法。

Insight: 合成数据和扩散流可以是学习复杂3D交互功能性的有效替代方案,避免了数据标注的瓶颈。

Abstract: Understanding how humans interact with the surrounding environment, and specifically reasoning about object interactions and affordances, is a critical challenge in computer vision, robotics, and AI. Current approaches often depend on labor-intensive, hand-labeled datasets capturing real-world or simulated human-object interaction (HOI) tasks, which are costly and time-consuming to produce. Furthermore, most existing methods for 3D affordance understanding are limited to contact-based analysis, neglecting other essential aspects of human-object interactions, such as orientation (\eg, humans might have a preferential orientation with respect certain objects, such as a TV) and spatial occupancy (\eg, humans are more likely to occupy certain regions around an object, like the front of a microwave rather than its back). To address these limitations, we introduce \emph{H2OFlow}, a novel framework that comprehensively learns 3D HOI affordances – encompassing contact, orientation, and spatial occupancy – using only synthetic data generated from 3D generative models. H2OFlow employs a dense 3D-flow-based representation, learned through a dense diffusion process operating on point clouds. This learned flow enables the discovery of rich 3D affordances without the need for human annotations. Through extensive quantitative and qualitative evaluations, we demonstrate that H2OFlow generalizes effectively to real-world objects and surpasses prior methods that rely on manual annotations or mesh-based representations in modeling 3D affordance.


[5] OCR-Quality: A Human-Annotated Dataset for OCR Quality Assessment cs.CV | cs.AIPDF

Yulong Zhang

TL;DR: 论文提出了OCR-Quality,这是一个用于评估和开发OCR质量评估方法的人工标注数据集,包含1000个PDF页面转换的PNG图像,涵盖多种真实场景,并提供4级质量评分。

Details

Motivation: OCR技术在真实场景中的质量评估缺乏可靠的数据集,限制了OCR验证系统的开发和改进。

Result: OCR-Quality数据集为OCR质量评估提供了基准,并公开可用,支持OCR验证系统的训练和评估。

Insight: 该数据集的设计考虑了多样性和实际需求,强调了人工标注的重要性,为OCR技术的进一步研究提供了基础。

Abstract: We present OCR-Quality, a comprehensive human-annotated dataset designed for evaluating and developing OCR quality assessment methods. The dataset consists of 1,000 PDF pages converted to PNG images at 300 DPI, sampled from diverse real-world scenarios, including academic papers, textbooks, e-books, and multilingual documents. Each document has been processed using state-of-the-art Vision-Language Models (VLMs) and manually annotated with quality scores using a 4-level scoring system (1: Excellent, 2: Good, 3: Fair, 4: Poor). The dataset includes detailed source information, annotation guidelines, and representative cases across various difficulty levels. OCR-Quality addresses the critical need for reliable OCR quality assessment in real-world applications and provides a valuable benchmark for training and evaluating OCR verification systems. The dataset is publicly available at https://huggingface.co/datasets/Aslan-mingye/OCR-Quality .


[6] Face-MakeUpV2: Facial Consistency Learning for Controllable Text-to-Image Generation cs.CV | cs.AI | eess.IVPDF

Dawei Dai, Yinxiu Zhou, Chenghang Li, Guolai Jiang, Chengfang Zhang

TL;DR: Face-MakeUpV2是一个新的文本到图像生成模型,专注于解决面部图像生成中的属性泄露和物理一致性问题。通过构建大规模数据集和引入双重面部信息注入通道,结合优化目标,实现了面部ID和物理特性的高一致性。

Details

Motivation: 当前文本到图像模型在响应局部语义指令时存在面部属性泄露和物理一致性问题,导致生成的面部图像缺乏真实性和可控性。

Result: Face-MakeUpV2在保持面部ID和物理一致性方面表现最佳,展示了其在可靠和可控面部编辑中的潜力。

Insight: 大规模数据集和双重信息注入通道的结合是解决文本到图像生成中面部一致性问题的有效途径。

Abstract: In facial image generation, current text-to-image models often suffer from facial attribute leakage and insufficient physical consistency when responding to local semantic instructions. In this study, we propose Face-MakeUpV2, a facial image generation model that aims to maintain the consistency of face ID and physical characteristics with the reference image. First, we constructed a large-scale dataset FaceCaptionMask-1M comprising approximately one million image-text-masks pairs that provide precise spatial supervision for the local semantic instructions. Second, we employed a general text-to-image pretrained model as the backbone and introduced two complementary facial information injection channels: a 3D facial rendering channel to incorporate the physical characteristics of the image and a global facial feature channel. Third, we formulated two optimization objectives for the supervised learning of our model: semantic alignment in the model’s embedding space to mitigate the attribute leakage problem and perceptual loss on facial images to preserve ID consistency. Extensive experiments demonstrated that our Face-MakeUpV2 achieves best overall performance in terms of preserving face ID and maintaining physical consistency of the reference images. These results highlight the practical potential of Face-MakeUpV2 for reliable and controllable facial editing in diverse applications.


[7] Ageing Drift in Binary Face Templates: A Bits-per-Decade Analysis cs.CV | 68T45, 68T10, 62H35PDF

Abdelilah Ganmati, Karim Afdel, Lahcen Koutti

TL;DR: 该论文研究了紧凑二进制人脸模板的长期稳定性,量化了年龄漂移(以每十年比特数为单位),并通过实验验证了码长与漂移的关系及其对实际部署的影响。

Details

Motivation: 研究人脸模板在长期使用中的年龄漂移现象,尤其是二进制编码模板的稳定性问题,为智能卡和卡上匹配(match-on-card)等实际应用提供理论基础和改进建议。

Result: 实验表明,64位和128位模板的中位漂移分别为1.357比特/十年和2.571比特/十年;漂移分布主要为正值,表明类内距离随时间增加;漂移与码长成正比,短码更稳定。

Insight: 1. 短码具有更好的年龄稳定性;2. 漂移现象普遍存在,需通过重新注册或优化比特稳定位来缓解;3. 实验结果支持了智能卡等低存储场景的实际部署策略。

Abstract: We study the longitudinal stability of compact binary face templates and quantify ageing drift directly in bits per decade. Float embeddings from a modern face CNN are compressed with PCA-ITQ into 64- and 128-bit codes. For each identity in AgeDB with at least three distinct ages, we form all genuine pairs and fit a per-identity linear model of Hamming distance versus absolute age gap. Across 566 identities, the median slope is 1.357 bits per decade for 64-bit templates and 2.571 bits per decade for 128-bit templates, with tight non-parametric 95 percent bootstrap confidence intervals. The distributions are predominantly positive, indicating a small but systematic increase in intra-class distance over time. Because drift scales with code length, shorter codes are inherently more age-stable at a fixed decision threshold. We connect these slopes to operating characteristics by reporting EER and TPR at FAR = 1 percent in three age bins. We discuss implications for smart-card and match-on-card deployments, including simple mitigations such as periodic re-enrolment and targeted parity on empirically unstable bit positions. Code and CSV artifacts are provided to support reproducibility.


[8] Bridging Accuracy and Interpretability: Deep Learning with XAI for Breast Cancer Detection cs.CV | cs.AIPDF

Bishal Chhetri, B. V. Rathish Kumar

TL;DR: 该研究提出了一种可解释的深度学习框架,用于乳腺癌的早期检测,结合了高准确率和可解释AI技术,显著提升了分类性能并增强了临床医生的信任。

Details

Motivation: 传统的深度学习模型虽然在乳腺癌检测中表现优异,但其黑箱特性阻碍了临床应用,因此需要结合可解释AI技术以提升透明度和可接受度。

Result: 模型在乳腺癌检测中表现卓越,准确率为0.992,F1分数为0.988,优于多种传统算法,并通过可解释技术识别出关键特征(细胞核的凹点)。

Insight: 细胞核的凹点是影响分类任务的最重要特征,这一发现有助于改进乳腺癌的诊断和治疗。

Abstract: In this study, we present an interpretable deep learning framework for the early detection of breast cancer using quantitative features extracted from digitized fine needle aspirate (FNA) images of breast masses. Our deep neural network, using ReLU activations, the Adam optimizer, and a binary cross-entropy loss, delivers state-of-the-art classification performance, achieving an accuracy of 0.992, precision of 1.000, recall of 0.977, and an F1 score of 0.988. These results substantially exceed the benchmarks reported in the literature. We evaluated the model under identical protocols against a suite of well-established algorithms (logistic regression, decision trees, random forests, stochastic gradient descent, K-nearest neighbors, and XGBoost) and found the deep model consistently superior on the same metrics. Recognizing that high predictive accuracy alone is insufficient for clinical adoption due to the black-box nature of deep learning models, we incorporated model-agnostic Explainable AI techniques such as SHAP and LIME to produce feature-level attributions and human-readable visualizations. These explanations quantify the contribution of each feature to individual predictions, support error analysis, and increase clinician trust, thus bridging the gap between performance and interpretability for real-world clinical use. The concave points feature of the cell nuclei is found to be the most influential feature positively impacting the classification task. This insight can be very helpful in improving the diagnosis and treatment of breast cancer by highlighting the key characteristics of breast tumor.


[9] EdgeSync: Accelerating Edge-Model Updates for Data Drift through Adaptive Continuous Learning cs.CV | cs.AIPDF

Runchu Donga, Peng Zhao, Guiqin Wang, Nan Qi, Jie Lin

TL;DR: EdgeSync是一种高效的边缘模型更新方法,通过自适应持续学习和动态训练管理,解决了边缘设备上因数据漂移导致模型精度下降的问题,显著提高了更新的时效性和准确性。

Details

Motivation: 实时视频分析系统中,数据特征的分布可能随时间变化(如光照和天气条件),导致边缘设备上的轻量级模型精度下降。现有方法在模型更新时存在计算延迟和与新数据分布不匹配的问题。

Result: 实验表明,EdgeSync在复杂真实数据集上比现有方法准确率提升了约3.4%,比传统方法提升了约10%。

Insight: EdgeSync通过动态管理和优化更新时序,有效平衡了计算负载与模型时效性,为边缘计算中的数据漂移问题提供了高效解决方案。

Abstract: Real-time video analytics systems typically deploy lightweight models on edge devices to reduce latency. However, the distribution of data features may change over time due to various factors such as changing lighting and weather conditions, leading to decreased model accuracy. Recent frameworks try to address this issue by leveraging remote servers to continuously train and adapt lightweight edge models using more complex models in the cloud. Despite these advancements, existing methods face two key challenges: first, the retraining process is compute-intensive, causing significant delays in model updates; second, the new model may not align well with the evolving data distribution of the current video stream. To address these challenges, we introduce EdgeSync, an efficient edge-model updating approach that enhances sample filtering by incorporating timeliness and inference results, thus ensuring training samples are more relevant to the current video content while reducing update delays. Additionally, EdgeSync features a dynamic training management module that optimizes the timing and sequencing of model updates to improve their timeliness. Evaluations on diverse and complex real-world datasets demonstrate that EdgeSync improves accuracy by approximately 3.4% compared to existing methods and by about 10% compared to traditional approaches.


[10] Promptable Fire Segmentation: Unleashing SAM2’s Potential for Real-Time Mobile Deployment with Strategic Bounding Box Guidance cs.CVPDF

Emmanuel U. Ugwu, Zhang Xinming

TL;DR: 该论文首次全面评估了SAM2变体在火焰分割任务中的性能,重点研究了边界框提示策略对移动部署可行性的提升。实验表明,边界框提示策略在火焰分割中表现最优,尤其是结合多点提示的Box+MP方法,同时轻量级变体(如TinySAM和MobileSAM)更适合低资源边缘场景。

Details

Motivation: 火焰分割在计算机视觉中是一个具有挑战性的任务,因为火焰具有不规则边界、半透明边缘和高度变化的强度。尽管SAM和SAM2展示了卓越的跨领域泛化能力,但它们在火焰分割任务中的表现,尤其是在移动部署场景下的应用潜力尚未被充分探索。

Result: 边界框提示策略表现最佳,其中Box+MP在Khan数据集上取得了最高的平均IoU(0.64)和Dice系数(0.75)。轻量级变体显著降低了计算和内存开销,更适合边缘部署。

Insight: 1. 边界框提示策略在火焰分割中更为有效;2. 轻量级模型在计算资源受限的场景中具有实用价值;3. 多点提示可以进一步提升分割精度。

Abstract: Fire segmentation remains a critical challenge in computer vision due to flames’ irregular boundaries, translucent edges, and highly variable intensities. While the Segment Anything Models (SAM and SAM2) have demonstrated impressive cross-domain generalization capabilities, their effectiveness in fire segmentation – particularly under mobile deployment constraints – remains largely unexplored. This paper presents the first comprehensive evaluation of SAM2 variants for fire segmentation, focusing on bounding box prompting strategies to enhance deployment feasibility. We systematically evaluate four SAM2.1 variants (tiny, small, base_plus, large) alongside mobile-oriented variants (TinySAM, MobileSAM) across three fire datasets using multiple prompting strategies: automatic, single positive point (SP), single positive point + single negative point (SP+SN), multiple positive points (MP), bounding box (Box), and hybrid variants (Box+SP and Box+MP). Our experimental results demonstrate that bounding box prompts consistently outperform automatic and single point-based approaches, with Box+MP achieving the highest mean IoU (0.64) and Dice coefficient (0.75) on the Khan dataset. Lightweight variants such as TinySAM and MobileSAM further reduce memory and computational costs, making them more suitable for latency-tolerant edge scenarios. Overall, this work provides critical insights for deploying promptable segmentation models in fire monitoring systems and establishes benchmarks for future research in domain-specific SAM applications. Code is available at: https://github.com/UEmmanuel5/ProFSAM


[11] Multi-Agent Pose Uncertainty: A Differentiable Rendering Cramér-Rao Bound cs.CV | cs.GR | cs.LG | cs.ROPDF

Arun Muthukkumar

TL;DR: 该论文提出了一种基于可微渲染器的方法,推导出相机位姿估计协方差的闭式下界,并通过线性化图像形成过程在多智能体场景中扩展应用。

Details

Motivation: 尽管位姿估计在计算机视觉和机器人领域广泛应用,但很少有工作能够对密集或学习模型下的位姿进行严格的不确定性量化。本文旨在填补这一空白。

Result: 该方法不仅与传统的光束平差不确定性量化一致,还在多智能体协作感知等任务中展现了应用潜力。

Insight: 该工作为位姿估计的不确定性提供了一种新的理论框架,尤其是在密集模型和多智能体系统中具有广泛的应用前景。

Abstract: Pose estimation is essential for many applications within computer vision and robotics. Despite its uses, few works provide rigorous uncertainty quantification for poses under dense or learned models. We derive a closed-form lower bound on the covariance of camera pose estimates by treating a differentiable renderer as a measurement function. Linearizing image formation with respect to a small pose perturbation on the manifold yields a render-aware Cram'er-Rao bound. Our approach reduces to classical bundle-adjustment uncertainty, ensuring continuity with vision theory. It also naturally extends to multi-agent settings by fusing Fisher information across cameras. Our statistical formulation has downstream applications for tasks such as cooperative perception and novel view synthesis without requiring explicit keypoint correspondences.


cs.CL [Back]

[12] Policy Optimization Prefers The Path of Least Resistance cs.CLPDF

Debdeep Sanyal, Aakash Sen Sharma, Dhruv Kumar, Saurabh Deshpande, Murari Mandal

TL;DR: 论文研究了策略优化(PO)在多步推理任务中的行为,发现PO倾向于选择最简单的路径(即直接回答),即使复杂路径(如思考后回答)有更高的奖励权重。

Details

Motivation: 现有研究强制语言模型采用严格的‘思考-回答’格式以生成链式推理(CoT),但PO在这种约束放宽后的开放结构中的行为尚未充分研究。

Result: PO会优先优化最简单的奖励组件,甚至在有互斥选择或强激励复杂行为时也是如此。此外,PO的这种行为需要足够的KL正则化自由度。

Insight: 赋予策略自由度是双刃剑:虽有助于发现高奖励捷径,但也可能导致奖励函数被滥用(即奖励黑客)。这为对齐问题提出了关键挑战。

Abstract: Policy optimization (PO) algorithms are used to refine Large Language Models for complex, multi-step reasoning. Current state-of-the-art pipelines enforce a strict think-then-answer format to elicit chain-of-thought (CoT); however, the behavior of PO when these rigid constraints are relaxed into an open-ended CoT structure remains an under-studied question. We investigate this gap with an extensive suite of controlled experiments and identify a consistent principle: \textit{policy optimization consistently follows the path of least resistance}. When afforded the flexibility to interleave reasoning and response, policy optimization consistently learns to discard explicit reasoning, causing the policy to degenerate to a direct \texttt{}-only format. This outcome holds true across various models and algorithms. We find that this collapse in format is persistent even when the complex \texttt{} format is assigned up to 4x larger reward weights. We formalize this principle through a series of controlled reward decomposition experiments, demonstrating a clear hierarchy: PO systematically optimizes for the simplest reward component first, a preference that holds even when faced with mutually exclusive choices or strong incentives for more complex behaviors. Finally, we show that successful convergence on the high-reward shortcut is not a low-effort drift but is driven by the optimization process that requires the KL-regularized policy to have sufficient freedom to make a significant shift from its initial prior. Our findings reveal that granting policies the freedom to diverge is a double-edged sword: while necessary for discovering high-reward shortcuts, it also creates a powerful incentive to game the simplest aspects of the reward function, posing a critical challenge for reward hacking under alignment.


[13] Framework for Machine Evaluation of Reasoning Completeness in Large Language Models For Classification Tasks cs.CL | cs.AIPDF

Avinash Patil

TL;DR: 本文提出了RACE框架,用于评估大型语言模型(LLM)生成的解释与基于逻辑回归的特征重要性之间的对齐程度,揭示了LLM解释的忠实性与不对称性。

Details

Motivation: 随着机器学习在敏感领域的广泛应用,对透明和可解释AI的需求日益增长。LLM能生成自然语言解释,但其是否真实反映了预测信号尚不明确。

Result: 实证结果显示,正确预测的解释更支持重要特征,而错误预测的解释则与矛盾特征相关。编辑距离匹配进一步揭示了释义重叠现象。

Insight: LLM解释既包含表面证据的复用,也可能在错误情况下放大误导信号,RACE为评估神经语言模型的推理完备性提供了量化基础。

Abstract: The growing adoption of machine learning (ML) in sensitive domains has heightened the demand for transparent and interpretable artificial intelligence. Large Language Models (LLMs) are increasingly capable of producing natural language explanations, yet it remains unclear whether these rationales faithfully capture the predictive signals that underlie decisions. This paper introduces RACE-Reasoning Alignment for Completeness of Explanations, a systematic framework to evaluate the alignment between LLM-generated explanations and interpretable feature importance scores derived from a logistic regression baseline. We analyze four widely used text classification datasets-WIKI ONTOLOGY, AG NEWS, IMDB, and GOEMOTIONS-and compare LLM rationales against top-ranked supporting and contradicting lexical features. To capture alignment at multiple levels of granularity, RACE implements token-aware, exact string, and edit-distance matching techniques. Empirical results reveal a consistent asymmetry: correct predictions exhibit higher coverage of supporting features, while incorrect predictions are associated with elevated coverage of contradicting features. Edit-distance matching further uncovers paraphrastic overlaps, boosting coverage while preserving this asymmetry. These findings demonstrate that LLM rationales combine both surface-level and flexible evidence reuse, yet can also amplify misleading cues in error cases. RACE provides new insights into the faithfulness of LLM explanations and establishes a quantitative basis for evaluating reasoning completeness in neural language models.


[14] Understanding Network Behaviors through Natural Language Question-Answering cs.CL | cs.AIPDF

Mingzhe Xing, Chang Tian, Jianan Zhang, Lichen Pan, Peipei Liu

TL;DR: NetMind是一个基于自然语言的网络行为理解框架,通过树状配置分块策略和统一事实图解决LLM在长上下文理解、设备异构性和复杂推理方面的挑战,并在实验中表现优于现有方法。

Details

Motivation: 现代大规模网络的复杂性增加了配置错误的风险,现有方法依赖于领域特定语言和形式模型,学习门槛高且灵活性有限。自然语言(NL)提供了更直观的接口,但LLM在处理长配置、异构性和复杂推理时仍存在挑战。

Result: 实验表明,NetMind在网络行为理解任务中准确且可扩展,优于现有基线。

Insight: 结合分块策略和中间表示(统一事实图)是增强LLM在复杂网络场景中表现的有效途径,同时混合语言设计能在不牺牲灵活性的情况下提升精度。

Abstract: Modern large-scale networks introduce significant complexity in understanding network behaviors, increasing the risk of misconfiguration. Prior work proposed to understand network behaviors by mining network configurations, typically relying on domain-specific languages interfaced with formal models. While effective, they suffer from a steep learning curve and limited flexibility. In contrast, natural language (NL) offers a more accessible and interpretable interface, motivating recent research on NL-guided network behavior understanding. Recent advances in large language models (LLMs) further enhance this direction, leveraging their extensive prior knowledge of network concepts and strong reasoning capabilities. However, three key challenges remain: 1) numerous router devices with lengthy configuration files challenge LLM’s long-context understanding ability; 2) heterogeneity across devices and protocols impedes scalability; and 3) complex network topologies and protocols demand advanced reasoning abilities beyond the current capabilities of LLMs. To tackle the above challenges, we propose NetMind, a novel framework for querying networks using NL. Our approach introduces a tree-based configuration chunking strategy to preserve semantic coherence while enabling efficient partitioning. We then construct a unified fact graph as an intermediate representation to normalize vendor-specific configurations. Finally, we design a hybrid imperative-declarative language to reduce the reasoning burden on LLMs and enhance precision. We contribute a benchmark consisting of NL question-answer pairs paired with network configurations. Experiments demonstrate that NetMind achieves accurate and scalable network behavior understanding, outperforming existing baselines.


[15] Deep Literature Survey Automation with an Iterative Workflow cs.CL | cs.AIPDF

Hongbo Zhang, Han Cui, Yidong Wang, Yijian Tian, Qi Guo

TL;DR: 这篇论文提出了一种基于迭代工作流程的自动化文献综述框架(IterSurvey),通过动态大纲生成和多模态元素整合,显著提升了综述的质量和可读性。

Details

Motivation: 现有的自动化文献综述系统多采用一次性检索和静态大纲生成的范式,导致检索噪声大、结构碎片化和上下文过载,影响了综述质量。为解决这些问题,论文受人类研究者迭代阅读过程的启发,提出了新的框架。

Result: 实验表明,IterSurvey在内容覆盖、结构连贯性和引用质量上优于现有基线,同时生成的综述更易读和组织性更强。Survey-Arena基准进一步验证了这些改进的可靠性。

Insight: 论文揭示了动态大纲生成和多模态整合对自动化文献综述的重要性,同时强调了评测基准在评估质量提升中的关键作用。

Abstract: Automatic literature survey generation has attracted increasing attention, yet most existing systems follow a one-shot paradigm, where a large set of papers is retrieved at once and a static outline is generated before drafting. This design often leads to noisy retrieval, fragmented structures, and context overload, ultimately limiting survey quality. Inspired by the iterative reading process of human researchers, we propose \ours, a framework based on recurrent outline generation, in which a planning agent incrementally retrieves, reads, and updates the outline to ensure both exploration and coherence. To provide faithful paper-level grounding, we design paper cards that distill each paper into its contributions, methods, and findings, and introduce a review-and-refine loop with visualization enhancement to improve textual flow and integrate multimodal elements such as figures and tables. Experiments on both established and emerging topics show that \ours\ substantially outperforms state-of-the-art baselines in content coverage, structural coherence, and citation quality, while producing more accessible and better-organized surveys. To provide a more reliable assessment of such improvements, we further introduce Survey-Arena, a pairwise benchmark that complements absolute scoring and more clearly positions machine-generated surveys relative to human-written ones. The code is available at https://github.com/HancCui/IterSurvey\_Autosurveyv2.


[16] Penalizing Length: Uncovering Systematic Bias in Quality Estimation Metrics cs.CLPDF

Yilin Zhang, Wenda Xu, Zhongtao Liu, Tetsuji Nakagawa, Markus Freitag

TL;DR: 该论文揭示了质量评估(QE)指标在机器翻译中存在的系统性长度偏差问题,并提出了两种缓解策略。

Details

Motivation: 质量评估(QE)指标在机器翻译中常被用于无参考评估和作为强化学习的奖励信号,但其对翻译长度的系统性偏差及其影响尚未充分研究。

Result: 实验表明,这两种策略能有效减少QE指标中的长度偏差。

Insight: 研究表明,QE指标的长度偏差可能导致对较长高质量翻译的不公平惩罚,进而影响其在重排序和强化学习中的应用效果。

Abstract: Quality Estimation (QE) metrics are vital in machine translation for reference-free evaluation and as a reward signal in tasks like reinforcement learning. However, the prevalence and impact of length bias in QE have been underexplored. Through a systematic study of top-performing regression-based and LLM-as-a-Judge QE metrics across 10 diverse language pairs, we reveal two critical length biases: First, QE metrics consistently over-predict errors with increasing translation length, even for high-quality, error-free texts. Second, they exhibit a preference for shorter translations when multiple candidates are available for the same source text. These inherent length biases risk unfairly penalizing longer, correct translations and can lead to sub-optimal decision-making in applications such as QE reranking and QE guided reinforcement learning. To mitigate this, we propose two strategies: (a) applying length normalization during model training, and (b) incorporating reference texts during evaluation. Both approaches were found to effectively reduce the identified length bias.


[17] ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality cs.CL | cs.LGPDF

Shayne Longpre, Sneha Kudugunta, Niklas Muennighoff, I-Hung Hsu, Isaac Caswell

TL;DR: 论文通过774次多语言训练实验,提出自适应迁移缩放定律(ATLAS),研究发现显著优于现有缩放定律,揭示了多语言学习的动态特性、语言间迁移效果及多语言诅咒的解决方案。

Details

Motivation: 现有缩放定律研究主要集中于英语,缺乏对多语言场景的支持,无法满足全球数十亿非英语用户的需求。

Result: ATLAS定律的样本外泛化能力优于现有方法(R^2提升0.3以上),揭示了语言间迁移的最优策略。

Insight: 多语言模型可通过优化模型大小和数据分配避免性能下降,同时从头训练与微调的选择需基于算力成本权衡。

Abstract: Scaling laws research has focused overwhelmingly on English – yet the most prominent AI models explicitly serve billions of international users. In this work, we undertake the largest multilingual scaling laws study to date, totaling 774 multilingual training experiments, spanning 10M-8B model parameters, 400+ training languages and 48 evaluation languages. We introduce the Adaptive Transfer Scaling Law (ATLAS) for both monolingual and multilingual pretraining, which outperforms existing scaling laws’ out-of-sample generalization often by more than 0.3 R^2. Our analyses of the experiments shed light on multilingual learning dynamics, transfer properties between languages, and the curse of multilinguality. First, we derive a cross-lingual transfer matrix, empirically measuring mutual benefit scores between 38 x 38=1444 language pairs. Second, we derive a language-agnostic scaling law that reveals how to optimally scale model size and data when adding languages without sacrificing performance. Third, we identify the computational crossover points for when to pretrain from scratch versus finetune from multilingual checkpoints. We hope these findings provide the scientific foundation for democratizing scaling laws across languages, and enable practitioners to efficiently scale models – beyond English-first AI.