Feature Selection Based on Divergence Functions: A Comparative Classiffication Study

  • Saeid Pourmand Shahid Beheshti University
  • Ashkan Shabbak Statistical Research and Training Center (SRTC)
  • Mojtaba Ganjali Shahid Beheshti University
Keywords: Dimensionality Reduction, Machine Learning, Feature Selection, Filters, Wrappers, Embedded Methods, Divergence Functions.


Due to the extensive use of high-dimensional data and its application in a wide range of scientifc felds of research, dimensionality reduction has become a major part of the preprocessing step in machine learning. Feature selection is one procedure for reducing dimensionality. In this process, instead of using the whole set of features, a subset is selected to be used in the learning model. Feature selection (FS) methods are divided into three main categories: flters, wrappers, and embedded approaches. Filter methods only depend on the characteristics of the data, and do not rely on the learning model at hand. Divergence functions as measures of evaluating the differences between probability distribution functions can be used as flter methods of feature selection. In this paper, the performances of a few divergence functions such as Jensen-Shannon (JS) divergence and Exponential divergence (EXP) are compared with those of some of the most-known flter feature selection methods such as Information Gain (IG) and Chi-Squared (CHI). This comparison was made through accuracy rate and F1-score of classifcation models after implementing these feature selection methods.


S.I. Amari, α-Divergence Is Unique, Belonging to Both f-Divergence and Bregman Divergence Classes, IEEE Transactions on Information Theory, 55(11), pp.4925–4931, 2009.

M. Basseville, Divergence measures for statistical data processingAn annotated bibliography, Signal Processing, 93(4), pp.621– 633, 2013.

L.M. Bregman, The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming, USSR: computational mathematics and mathematical physics, 7(3), pp.200–217, 1967.

O.Calin, and C. Udriste, Geometric modeling in probability and statistics, Berlin: Springer, 2014.

A. Cichocki, and S.I. Amari, Families of alpha-beta-and gamma-divergences: Flexible and robust measures of similarities, Entropy, 12(6), pp.1532–1568, 2010.

L. Cui, Y. Jiao, L. Bai, L. Rossi, and E.R. Hancock, Adaptive feature selection based on the most informative graph-based features, In International Workshop on Graph-Based Representations in Pattern Recognition, pp. 276–287. Springer, Cham, 2017.

D. Dua, and C. Graff, UCI Machine Learning Repository . Irvine, CA: University of California, School of Information and Computer Science, [http://archive.ics.uci.edu/ml], 2019.

G.H. Fu, Y.J. Wu, M.J. Zong, and J. Pan, Hellinger distance-based stable sparse feature selection for high dimensional class-imbalanced data, BMC bioinformatics, 21, pp.1–14, 2020.

B. Fuglede, and F. Topsoe, Jensen-Shannon divergence and Hilbert space embedding, In International Symposium on Information Theory, ISIT Proceedings. p. 31, IEEE, 2004, June.

I. Guyon, A. Elisseeff, An introduction to variable and feature selection, Journal of machine learning research, 3 (Mar), pp.1157–1182, 2003.

I. Guyon, A. Elisseeff, An introduction to feature extraction, In Feature extraction, pp.1–25. Springer, Berlin, Heidelberg, 2006.

R. Guzmn-Martnez, and R. Alaiz-Rodrguez, Feature selection stability assessment based on the jensen-shannon divergence, In Joint EuropeanConferenceonMachineLearningandKnowledgeDiscoveryinDatabases,pp.597–612.Springer,Berlin,Heidelberg,2011, September.

E. Hart, K. Sim, B. Gardiner, and K. Kamimura, A hybrid method for feature construction and selection to improve wind-damage prediction in the forestry sector, In Proceedings of the Genetic and Evolutionary Computation Conference, pp.1121–1128, 2017, July.

A. Hashemi and M.B. Dowlatshahi, MLCR: a fast multi-label feature selection method based on K-means and L2-norm, In 2020 25th International Computer Conference, Computer Society of Iran (CSICC), pp.1–7, IEEE, 2020, January.

Y. Jiang, N. Zhao, L. Peng, and S. Liu, A new hybrid framework for probabilistic wind speed prediction using deep feature selection and multi-error modification, Energy Conversion and Management, 199, p.111981, 2019.

S. Kullback, and R.A. Leibler, On information and sufficiency, The annals of mathematical statistics, 22(1), pp.79–86, 1951.

V. Kumar, and S. Minz, Feature selection: a literature review, SmartCR, 4(3), pp.211–229, 2014.

T.N. Lal, O. Chapelle, J. Weston, and A. Elisseeff, Embedded methods, In Feature extraction , pp.137–165. Springer, Berlin, Heidelberg, 2006.

C. Lee, and G.G. Lee, Information gain and divergence-based feature selection for machine learning-based text categorization, Information processing & management, 42(1), pp.155–165, 2006.

J. Li, K. Cheng, S. Wang, F. Mostatter, R.P. Trevino, J. Tang, and H. Liu, Feature selection: A data perspective, ACM Computing Surveys (CSUR), 50(6), pp.1–45, 2017.

Y. Lifang, Q. Sijun, and Z. Huan, Feature selection algorithm for hierarchical text classification using Kullback-Leibler divergence, In IEEE 2nd International Conference on Cloud Computing and Big Data Analysis (ICCBDA), pp. 421–424, IEEE, 2017.

J. Lin, Divergence measures based on the Shannon entropy, IEEE Transactions on Information theory, 37(1), pp.145–151, 1991.

H. Liu, and R. Setiono, Chi2: Feature selection and discretization of numeric attributes, In Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence, pp. 388–391. IEEE, 1995, November.

K.R. Niazi, C.M. Arora, and S.L. Surana, Power system security evaluation using ANN: feature selection using divergence, Electric Power Systems Research, 69(2-3), pp.161–167, 2004.

J. Novovicov, P. Pudil, and J. Kittle, Divergence based feature selection for multimodal class densities, IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(2), pp.218–223, 1996.

F.S. sterreicher, Csiszrs f-divergences-basic properties, RGMIA Res. Rep. Coll, 2002.

J.R. Quinlan, Induction of decision trees, Machine learning, 1(1), pp.81–106, 1986.

K.M. Schneider, A new feature selection score for multinomial na¨ıve Bayes text classification based on KL-divergence, In Proceedings of the ACL Interactive Poster and Demonstration Sessions, pp. 186–189, 2004, July.

F. Thabtah, F. Kamalov, S. Hammoud, and S.R. Shahamiri, Least Loss: A Simplified Filter Method for Feature Selection, Information Sciences, 2020.

P. Temrat, Y. Jiraraksopakun, Y. Bhatranand, and K. Wea-asae, Suitable feature selection for OSA classification based on snoring sounds, In 2018 15th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), pp.1–4, IEEE, 2018.

T. Van Erven, and P. Harremos, Rnyi divergence and Kullback-Leibler divergence, IEEE Transactions on Information Theory, 60(7), pp.3797–3820, 2014.

J. Wang, Z. Feng, N. Lu, and J. Luo, Toward optimal feature and time segment selection by divergence method for EEG signals, classification, Computers in biology and medicine, 97, pp.161–170, 2018.

S. Yoon, Y. Song, K.C. Bureau, M. Kim, F.C. Park, and Y.k. Noh, Interpretable Feature Selection Using Local Information For Credit Assessment, 2018.

Y. Zhang, S. Li, T. Wang, and Z. Zhang, Divergence-based feature selection for separate classes, Neurocomputing, 101, pp.32–42, 2013.

How to Cite
Pourmand, S., Shabbak, A., & Ganjali, M. (2021). Feature Selection Based on Divergence Functions: A Comparative Classiffication Study. Statistics, Optimization & Information Computing, 9(3), 587-606. https://doi.org/10.19139/soic-2310-5070-1092
Research Articles