Proposed Two-Steps Procedure of Classification High Dimensional Data with Regularized Logistic Regression

Omar Alshebly; Suhail N.  Abdullah

doi:10.19139/soic-2310-5070-1846

Omar Alshebly University of Mosul-College of Computer Sciences and Mathematics https://orcid.org/0000-0002-4126-3259
Suhail N. Abdullah

DOI: https://doi.org/10.19139/soic-2310-5070-1846

Abstract

The field of Bioinformatics has developed in response to the rapid increase in biological data, particularly high-dimensional gene expression data. Bioinformatics utilizes optimization, computational science, and statistical methods to effectively address challenges in the field of molecular biology. Numerous genes (variables) in gene expression are irrelevant to their study. Gene selection has been demonstrated to be an effective means of enhancing the performance of numerous methods of classification. The job of acquiring significant variables via the use of ranking variable selection (RVS) techniques and then picking the most effective classifier is an enormous challenge in the context of high-dimensional data. in this study, we proposed a new ranking filter method using smooth clipped absolute deviation depending on the resampling technique (RSVS) to obtain a proficient subset of genes with strong classification abilities. This is achieved by merging A screening technique employed as a filtering method in conjunction with Regularized Logistic Regression, such as LASSO,ALASSO,ENET, and MCP. The study involved the utilization of both simulated and real datasets to conduct an empirical evaluation of the proposed approach. The findings indicated that the proposed method outperformed other established methods. it was tested using three publicly data sets about Cancer. The Results demonstrate that the suggested approach is highly effective and viable, thus showing a strong level of performance with regards to accuracy, geometric mean, and the area under the curve. Furthermore, The findings suggest that the genes most often chosen are physiologically associated with the specific form of cancer. Therefore, the method that has been suggested has potential advantages for the classification of cancer via the use of DNA gene expression data within a clinical setting.

Author Biography

Omar Alshebly, University of Mosul-College of Computer Sciences and Mathematics

References

Honrado, E., Osorio, A., Palacios, J., and Ben´ıtez, J. Pathology and gene expression of hereditary breast tumors associated with BRCA1, BRCA2 and CHEK2 gene mutations, Oncogene,vol. 25,no. 43, pp. 5837-5845,2006.

Hussein, N.A.K., and Al-Sarray, B, Deep Learning and Machine Learning via a Genetic Algorithm to Classify Breast Cancer DNA Data, Iraqi Journal of Science, pp. 3153-3168,2022.

Bhola, A., and Singh, S., Gene selection using high dimensional gene expression data: an appraisal, Current Bioinformatics,vol. 13,no. 3, pp. 225-233,2018.

Bourgon, R., Gentleman, R., and Huber, W. Independent filtering increases detection power for high-throughput experiments, Proceedings of the National Academy of Sciences,vol. 107, no. 21, pp. 9546-9551,2010.

Kim, S., and Kim, J.-M. Two-stage classification with sis using a new filter ranking method in high throughput data,

Mathematics,vol. 7,no. 6, pp. 493,2019.

Algamal, Z.Y., and Lee, M.H. Penalized logistic regression with the adaptive LASSO for gene selection in high-dimensional cancer classification, Expert Systems with Applications,vol. 42,no. 23, pp. 9326-9332,2015.

Piao, Y., Piao, M., Park, K., and Ryu, K.H. An ensemble correlation-based gene selection algorithm for cancer classification with gene expression data, Bioinformatics,vol. 28, no. 24, pp. 3306-3315,2012.

Chen, W., and Angelia, S. Classification consistency analysis for bootstrapping gene selection, Journal of Statistics,vol. 39, pp. 7270-7280,2013.

Sun, J., Wu, Q., Shen, D., Wen, Y., Liu, F., Gao, Y., Ding, J., and Zhang, J. TSLRF: two-stage algorithm based on least angle regression and random forest in genome-wide association studies, Scientific reports,vol. 9, no. 1, pp. 18034,2019.

Algamal, Z.Y., and Lee, M.H. A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification, Advances in data analysis and classification,vol. 13,no. 3, pp. 753-771,2019.

Ross, Q.J. C4. 5: programs for machine learning, San Mateo, CA, 1993.

Guyon, I., and Elisseeff, A. An introduction to variable and feature selection, Journal of machine learning research,vol. 3, (Mar), pp. 1157-1182,2003.

UM, O. Estimating the Fisher’s Scoring Matrix Formula from Logistic Model, American Journal of Theoretical and Applied Statistics,2013.

Fan, J., and Lv, J. Sure independence screening for ultrahigh dimensional feature space, Journal of the Royal Statistical Society Series B: Statistical Methodology,vol. 70, no. 5, pp. 849-911,2008.

Hastie, T., Tibshirani, R., Friedman, J.H., and Friedman, J.H. The elements of statistical learning: data mining, inference, and prediction, Springer, 2009.

Tibshirani, R. Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society Series B: Statistical Methodology,vol. 58,no. 1, pp. 267-288,1996.

Marquardt, D.W., and Snee, R.D. Ridge regression in practice, he American Statistician,vol. 29, no. 1, pp. 3-20,1975.

Wang, Y., Yang, X.-G., and Lu, Y. Informative gene selection for microarray classification via adaptive elastic net with conditional mutual information, Applied Mathematical Modelling,vol. 71, pp. 286-297,2019.

Fan, J., and Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties, Journal of the American statistical Association,vol. 96, no. 456, pp. 1348-1360,2001.

Zhang, C.-H. Nearly unbiased variable selection under minimax concave penalty, Annals of Statistics,2010.

Song, L., Smola, A., Gretton, A., Borgwardt, K.M., and Bedo, J. Supervised feature selection via dependence estimation, CML, pp. 823-830,2007.

Pabitra Mitra, C. A. Murthy, and Sankar K. Pal. Unsupervised feature selection using feature similarity, IEEE Trans. Pattern Anal. Mach. Intell., 24:301–312, 2002.

Urbanowicz, R.J., Meeker, M., La Cava, W., Olson, R.S., and Moore, J.H. Relief-based feature selection: Introduction and review, Journal of biomedical informatics,vol. 85, pp. 189-203,2018.

Duda, R.O., Hart, P.E., and Stork, D.G. Solution Manual to accompany: Pattern Classification, second edition, 2000.

Mahdi, G.J., and Salih, O.M. Variable Selection Using aModified Gibbs Sampler Algorithm with Application on Rock Strength Dataset, Baghdad Science Journal,vol. 19, no. 3, pp. 0551-0559,2022.

Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R.P., Tang, J., and Liu, H. Feature selection: A data perspective, ACM computing surveys (CSUR),vol. 50, no. 6, pp. 1-45,2017.

Abd Algafore, H.A., and Hashem, S.H. Spam filtering based on na¨ıve Bayesian with information gain and ant colony system, Iraqi Journal of Science, pp. 719-727,2016.

Dash, R. A two stage grading approach for feature selection and classification of microarray data using Pareto based feature ranking techniques: A case study, Journal of King Saud University-Computer and Information Sciences,vol. 32, no. 2, pp. 232-247,2020.

Al-Tai, A.A., and Al-Kazaz, Q.N.N. Semi parametric Estimators for Quantile Model via LASSO and SCAD with Missing Data, Journal of Economics and Administrative Sciences,vol. 28, no. 133, pp. 82-96,2022.

Zou, H. and T. Hastie. Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society Series Bvol. 67,pp. 301-320,2005.

Patil, A.R., Chang, J., Leung, M.-Y., and Kim, S. Analyzing high dimensional correlated data using feature ranking and classifiers, Computational and Mathematical Biophysics,vol. 7,no. 1, pp. 98-120,2019.

Sokolova, M., Japkowicz, N., and Szpakowicz, S. Beyond accuracy, F-score and ROC: a family of discriminant measures for performance evaluation, In Australian Conference on Artificial Intelligence,2006.

Hosmer, D., and Lemeshow, S. Applied Logistic Regression, 2nd edition, Johnson Wiley and Sons, New York,2000.

Pi, L., and Halabi, S. Combined performance of screening and variable selection methods in ultra-high dimensional data in predicting time-to-event outcomes, Diagnostic and prognostic research,vol 2,no. 1, pp. 1-12,2018.

Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., and Caligiuri, M.A. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, science,vol. 286, no. 5439, pp. 531-537,1999.

Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., and Levine, A.J. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, Proceedings of the National Academy of Sciences,vol. 96, no. 12, pp. 6745-6750,1999.

Singh, D., Febbo, P.G., Ross, K., Jackson, D.G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A.A., D’Amico, A.V., and Richie, J.P. Gene expression correlates of clinical prostate cancer behavior, Cancer cell,vol. 1, no. 2, pp. 203-209,2002.

Yang, W., Ma, J., Zhou, W., Li, Z., Zhou, X., Cao, B., Zhang, Y., Liu, J., Yang, Z., and Zhang, H Identification of hub genes and outcome in colon cancer based on bioinformatics analysis, Cancer Management and Research, pp. 323-338,2018.

Shukir, F.S Class Prediction Methods Applied to Microarray Data for Classification, Iraqi Journal of Science,vol. 53, no.4, pp. 1193-1206,2012.

Chen, Y., Wang, L., Li, L., Zhang, H., and Yuan, Z. ‘Informative gene selection and the direct classification of tumors based on relative simplicity, BMC bioinformatics,vol. 17, no. 1, pp. 1-16,2016.

Mao, Z., Cai, W., and Shao, X. Selecting significant genes by randomization test for cancer classification using gene expression data, Journal of biomedical informatics,vol. 46, no. 4, pp. 594-601,2013.

Han, B., Li, L., Chen, Y., Zhu, L., and Dai, Q. A two step method to identify clinical outcome relevant genes with microarray data, Journal of Biomedical Informatics,vol. 44, no. 2, pp. 229-238,2011.

Liang, Y., Liu, C., Luan, X.-Z., Leung, K.-S., Chan, T.-M., Xu, Z.-B., and Zhang, H. Sparse logistic regression with a L1/2 penalty for gene selection in cancer classification, BMC bioinformatics,vol 14, no. 1, pp. 1-12,2013.

Wang, S.-L., Li, X., Zhang, S., Gui, J., and Huang, D.-S. Tumor classification by combining PNN classifier ensemble with

neighborhood rough set based gene reduction, Computers in Biology and Medicine,vol. 40, no. 2, pp. 179-189,2010.

Cheung, K., Ma, H., Tse, F., Yeung, F., Tsang,F., Chu, M., Kan, M., Cho, S., Ng, W. and Chan, C. The applications of metabolomics in the molecular diagnostics of cancer, Expert review of molecular diagnostics, vol. 19, pp. 785-793,2019.