Automated Noise Detection in a Database Based on a Combined Method

Mahdieh  Ataeyan; Negin Daneshpour

doi:10.19139/soic-2310-5070-879

Mahdieh Ataeyan Shahid Rajaee Teacher Training University
Negin Daneshpour Shahid Rajaee Teacher Training University

DOI: https://doi.org/10.19139/soic-2310-5070-879

Keywords: Data Cleaning, Automated Noise Detection, Clustering, K-means, Data Quality.

Abstract

Data quality has diverse dimensions, from which accuracy is the most important one. Data cleaning is one of the preprocessing steps in data mining which consists of detecting errors and repairing them. Noise is a common type of error, that occur in database. This paper proposes an automated method based on the k-means clustering for noise detection. At first, each attribute (Aj) is temporarily removed from data and the k-means clustering is applied to other attributes. Thereafter, the k-nearest neighbors is used in each cluster. After that a value is predicted for Aj in each record by the nearest neighbors. The proposed method detects noisy attributes using predicted values. Our method is able to identify several noises in a record. In addition, this method can detect noise in fields with different data types, too. Experiments show that this method can averagely detect 92% of the noises existing in the data. The proposed method is compared with a noise detection method using association rules. The results indicate that the proposed method have improved noise detection averagely by 13%.

References

O. Arbelaitz, I. Gurrutxaga, J. Muguerza, J. M. Prez, and I. Perona. An extensive comparative study of cluster validity indices. Pattern Recognition, 46(1):243 – 256, 2013.

M. Ataeyan and N. Daneshpour. A novel data repairing approach based on constraints and ensemble learning. Expert Systems with Applications, 159:113511, 2020.

G. Beskales, I. F. Ilyas, L. Golab, and A. Galiullin. Sampling from repairs of conditional functional dependency violations. The VLDB Journal, 23(1):103–128, Feb. 2014.

L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996.

S. Bruggemann. Rule Mining for Automatic Ontology Based Data Cleaning, chapter Progress in WWW Research and Development: 10th Asia-PacificWeb Conference, APWeb 2008, Shenyang, China, April 26-28, 2008. Proceedings, pages 522–527. Springer Berlin Heidelberg, Berlin, Heidelberg, 2008.

F. Chiang and S. Sitaramachandran. Unifying data and constraint repairs. J. Data and Information Quality, 7(3):9:1–9:26, Aug. 2016.

X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye. Katara: A data cleaning system powered by knowledge bases and crowdsourcing. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ’15, pages 1247–1261, New York, NY, USA, 2015. ACM.

W. Fan. Dependencies revisited for improving data quality. In Proceedings of the Twenty-seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’08, pages 159–170, New York, NY, USA, 2008. ACM.

W. Fan, S. Ma, N. Tang, and W. Yu. Interaction between record matching and data repairing. J. Data and Information Quality, 4(4):16:1–16:38, May 2014.

Y. Gao, C. Ge, X. Miao, H. Wang, B. Yao, and Q. Li. A hybrid data cleaning framework using markov logic networks. CoRR, abs/1903.05826, 2019.

C. He, Z. Tan, Q. Chen, and C. Sha. Repair diversification: A new approach for data repairing. Information Sciences, 346:90 – 105, 2016.

C. He, Z. Tan, Q. Chen, C. Sha, Z. Wang, and W. Wang. Repair Diversification for Functional Dependency Violations, chapter Database Systems for Advanced Applications: 19th International Conference, DASFAA 2014, Bali, Indonesia, April 21-24, 2014. Proceedings, Part II, pages 468–482. Springer International Publishing, 2014.

J. Hipp, U. Gntzer, and U. Grimmer. Data quality mining – making a virtue of necessity. In PROCEEDINGS OF THE 6TH ACM SIGMOD WORKSHOP ON RESEARCH ISSUES IN DATA MINING AND KNOWLEDGE DISCOVERY, pages 52–57, 2001.

A. Lopatenko and L. Bravo. Efficient approximation algorithms for repairing inconsistent databases. In 2007 IEEE 23rd International Conference on Data Engineering, pages 216–225, April 2007.

W. A. Malik and A. Unwin. Automated error detection using association rules. Intell. Data Anal., 15(5):749–761, Sept. 2011.

G. Rahman and Z. Islam. A decision tree-based missing value imputation technique for data pre-processing. In Proceedings of the Ninth Australasian Data Mining Conference - Volume 121, AusDM ’11, pages 41–50, Darlinghurst, Australia, Australia, 2011. Australian Computer Society, Inc.

M. G. Rahman and M. Z. Islam. Missing value imputation using decision trees and decision forests by splitting and merging records: Two novel techniques. Knowledge-Based Systems, 53:51 – 65, 2013.

J. Rammelaere and F. Geerts. Explaining repaired data with cfds. Proc. VLDB Endow., 11(11):1387–1399, July 2018.

J. Rammelaere and F. Geerts. Cleaning data with forbidden itemsets. IEEE Transactions on Knowledge and Data Engineering, pages 1–1, 2019.

T. Rekatsinas, X. Chu, I. F. Ilyas, and C. R´e. Holoclean: Holistic data repairs with probabilistic inference. Proc. VLDB Endow., 10(11):1190–1201, Aug. 2017.

P. J. Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20:53 – 65, 1987.

A. M. Sefidian and N. Daneshpour. Missing value imputation using a novel grey based fuzzy c-means, mutual information based feature selection, and regression model. Expert Systems with Applications, 115:68 – 94, 2019.

A. M. Sefidian and N. Daneshpour. Estimating missing data using novel correlation maximization based methods. Applied Soft Computing, 91:106249, 2020.

J. Segeren, D. Gairola, and F. Chiang. Condor: A system for constraint discovery and repair. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM ’14, pages 2087–2089, New York, NY, USA, 2014. ACM.

S. Song, H. Zhu, and J. Wang. Constraint-variance tolerant data repairing. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD ’16, pages 877–892, New York, NY, USA, 2016. ACM.

N. Tang. Big Data Cleaning, chapter Web Technologies and Applications: 16th Asia-Pacific Web Conference, APWeb 2014, Changsha, China, September 5-7, 2014. Proceedings, pages 13–24. Springer International Publishing, 2014.

C.-M. Teng. Correcting noisy data. In Proceedings of the Sixteenth International Conference on Machine Learning, ICML ’99, pages 239–248, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc.

C.-M. Teng. A comparison of noise handling techniques. In Proceedings of the Fourteenth International Florida Artificial Intelligence Research Society Conference, pages 269–273. AAAI Press, 2001.

C. M. Teng. Polishing blemishes: issues in data correction. IEEE Intelligent Systems, 19(2):34–39, Mar 2004.

P. H. Williams, C. R. Margules, and D. W. Hilbert. Data requirements and data sources for biodiversity priority area selection. Journal of Biosciences, 27(4):327–338, 2002.

J. Y. Xiang, S. Lee, and J. K. Kim. Data quality and firm performance: empirical evidence from the korean financial industry. Information Technology and Management, 14(1):59–65, Mar 2013.

M. Yakout, L. Berti- ´ Equille, and A. K. Elmagarmid. Don’t be scared: Use scalable automatic repairing with maximal likelihood and bounded changes. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD ’13, pages 553–564, New York, NY, USA, 2013. ACM.