Unveiling Quality-Factor Metrics to Optimize Data Collection
A Comprehensive Framework for Arabic Sentiment Analysis
DOI:
https://doi.org/10.19139/soic-2310-5070-2467Abstract
Our study introduces a novel quality-factor-driven approach to dataset evaluation, focusing on four key attributes: Class Distribution Index (CDI), Topic Distribution Index (TDI), Average Inverse Document Frequency (IDF), and dataset size. By systematically analyzing these factors, this research assesses their influence on model performance in Natural Language Processing (NLP), particularly in Arabic sentiment analysis. The findings reveal that CDI and TDI exhibit substantial impacts, with CDI showing a strong positive correlation with accuracy (0.5568) and F1-score (-0.7808), indicating that while class distribution imbalance might help the model achieve higher accuracy, it adversely impacts its F1-score, thus reducing the balance between precision and recall. TDI also negatively affects accuracy and F1-score (-0.2242 and 0.2031), underscoring the challenges of uneven text distribution across datasets. In contrast, Average IDF and dataset size positively correlate with model performance, with Average IDF contributing 0.2670 to accuracy and 0.3207 to F1-score, highlighting the predictive power of rare terms within the dataset. Dataset size further enhances F1-score (0.3540), reaffirming that larger datasets support improved sentiment classification accuracy. This study provides foundational insights into the effects of dataset quality on Arabic sentiment analysis, offering strategic directions for future research in underrepresented languages and advancing our understanding of data quality’s implications in NLP.Downloads
Published
2026-06-28
How to Cite
Banou, Z., Elfilali, S., Benlahmar, E. H., & Alaoui, F.-Z. (2026). Unveiling Quality-Factor Metrics to Optimize Data Collection: A Comprehensive Framework for Arabic Sentiment Analysis. Statistics, Optimization & Information Computing. https://doi.org/10.19139/soic-2310-5070-2467
Issue
Section
Research Articles
License
Copyright (c) 2026 Zouheir Banou, Sanaa Elfilali, El Habib Benlahmar, Fatima-Zahra Alaoui

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).