Unveiling Quality-Factor Metrics to Optimize Data Collection

A Comprehensive Framework for Arabic Sentiment Analysis

Authors

  • Zouheir Banou FSBM
  • Sanaa Elfilali
  • El Habib Benlahmar
  • Fatima-Zahra Alaoui

DOI:

https://doi.org/10.19139/soic-2310-5070-2467

Abstract

Our study introduces a novel quality-factor-driven approach to dataset evaluation, focusing on four key attributes: Class Distribution Index (CDI), Topic Distribution Index (TDI), Average Inverse Document Frequency (IDF), and dataset size. By systematically analyzing these factors, this research assesses their influence on model performance in Natural Language Processing (NLP), particularly in Arabic sentiment analysis. The findings reveal that CDI and TDI exhibit substantial impacts, with CDI showing a strong positive correlation with accuracy (0.5568) and F1-score (-0.7808), indicating that while class distribution imbalance might help the model achieve higher accuracy, it adversely impacts its F1-score, thus reducing the balance between precision and recall. TDI also negatively affects accuracy and F1-score (-0.2242 and 0.2031), underscoring the challenges of uneven text distribution across datasets. In contrast, Average IDF and dataset size positively correlate with model performance, with Average IDF contributing 0.2670 to accuracy and 0.3207 to F1-score, highlighting the predictive power of rare terms within the dataset. Dataset size further enhances F1-score (0.3540), reaffirming that larger datasets support improved sentiment classification accuracy. This study provides foundational insights into the effects of dataset quality on Arabic sentiment analysis, offering strategic directions for future research in underrepresented languages and advancing our understanding of data quality’s implications in NLP.

Downloads

Published

2026-06-28

How to Cite

Banou, Z., Elfilali, S., Benlahmar, E. H., & Alaoui, F.-Z. (2026). Unveiling Quality-Factor Metrics to Optimize Data Collection: A Comprehensive Framework for Arabic Sentiment Analysis. Statistics, Optimization & Information Computing. https://doi.org/10.19139/soic-2310-5070-2467

Issue

Section

Research Articles