Full Content-based Web Page Classification Methods by using Deep Neural Networks

Suleyman Suleymanzade; Fargana Abdullayeva

doi:10.19139/soic-2310-5070-1056

Suleyman Suleymanzade Institute of Information Technology of ANAS
Fargana Abdullayeva Institute of Information Technology of ANAS

DOI: https://doi.org/10.19139/soic-2310-5070-1056

Keywords: web page classification, LSTM, web crawler, deep learning, data aggregation

Abstract

The quality of the web page classification process has a huge impact on information retrieval systems. In this paper, we proposed to combine the results of text and image data classifiers to get an accurate representation of the web pages. To get and analyse the data we created the complicated classifier system with data miner, text classifier, and aggregator. The process of image and text data classification has been achieved by the deep learning models. In order to represent the common view onto the web pages, we proposed three aggregation techniques that combine the data from the classifiers.

References

Shashidhara Hr, Gt Raju, Prakasha Shivanna The Role of an Information Retrieval in the Current Era of Vast Computer Science Stream, International Journal of Soft Computing and Engineering, vol. 3, no. 3, 2013.

Arash Habibi Lashkari , Fereshteh Mahdavi , Vahid Ghomi, in A Boolean Model in Information Retrieval for Search Engines, Information Management and Engineering, ICIME, Kuala Lumpur, Malaysia, 2009.

Thomas Mandl Artificial Intelligence for Information Retrieval, Encyclopedia of Artificial Intelligence, 2008.

Jochen Hartmann, Juliana Huppertz, Christina Schamp, Mark Heitmann Comparing automated text classification methods, International Journal of Research in Marketing, vol. 36, no. 1, pp. 20-38, 2019.

Willy Susilo, Reihaneh Safavi-Naini, R. Du Web filtering using text classification, The 11th IEEE International Conference on Networks, Sydney, NSW, Australia, Australia, 2003.

Ahmed Saleh, Mohammed Rahmawy, Arwa E. Abulwafa, A semantic based Web page classification strategy using multi-layered domain ontology, World Wide Web, vol. 20, no. 5, pp. 1-55, 2017.

Hwang Tim, Computational Power and the Social Impact of Artificial Intelligence, SSRN Electronic Journal, no. ssrn.3147971, 2018.

Soumick Chatterjee, Asoke Nath, Auto-Explore the Web C Web Crawler, vol. 5, no. 4, pp. 6607-6618, 2017.

Monika Henzinger, Ingmar Weber, Ludmila Marian, Eda Baykan, Purely URL-based Topic Classification, in Proceedings of the 18th International Conference on World Wide Web, Madrid, 2009.

Daniel Lpez-Snchez, Juan Manuel Corchado Rodrłguez, Anglica Gonzlez, A CBR System for Image-Based Webpage Classification: Case Representation with Convolutional Neural Networks, in Conference: Florida Artificial Intelligence Research Society ConferenceAt: Marco Island, Florida, 2017.

Ebubekir Buber, Banu Diri, Web Page Classification Using RNN, in 8th International Congress of Information and Communication Technology, ICICT 2019, Istanbul, 2019.

Pvel Calado, Marco Cristo, Edleno Moura, Nivio Ziviani, Berthier Ribeiro-Neto, Combining link-based and content based methods for web document classification, in CIKM ’03: Proceedings of the twelfth international conference on Information and knowledge management, New Orleans, 2003.

Oliver Schulte, Kurt Routley, Aggregating Predictions vs. Aggregating Features for, in IEEE Symposium on Computational Intelligence and Data Mining (CIDM), 2014.

Afef Ben Brahim, Waad Bouaguel, Mohamed Limam, Feature Selection Aggregation Versus Classifiers, in International Conference on Control, Engineering & Information Technology (CEIT’13), 2013.

Marc Tanti, Albert Gatt, Kenneth P. Camilleri, Where to put the Image in an Image Caption, University of Malta, 2017.

Xin Yang, Peifeng Xiang, Yuanchun Shi, Semantic HTML Page Segmentation using Type Analysis, in Pervasive Computing and Applications (ICPCA), 2006.

Alastair R. Rae, Daniel Le, Jongwoo Kim, George R. Thoma, Main Content Detection in HTML Journal Articles, in Conference: the ACM Symposium, 2018.

Robert Györödi, Cornelia Györödi, George Pecherle, George Mihai Cornea, Web page analysis based on HTML DOM and its usage for forum statistics and alerts, in Proceedings of the 4th conference on European computing conference, 2010.

Shahzad Qaiser, Ramsha Ali, Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents, International Journal of Computer Applications, vol. 181, no. 1, 2018.

John S. Whissell, Charles Clarke, Improving document clustering using Okapi BM25 feature weighting, Information Retrieva, vol. 14, no. 5, pp. 466-487, 2011.

Gesare Asnath Tinega, Waweru Mwangi, Richard M. Rimiru, Text Mining in Digital Libraries using OKAPI BM25 Model, International Journal of Computer Applications Technology and Research, vol. 7, no. 10, pp. 398-406, 2019.

Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi, You Only Look Once: Unified, Real-Time Object Detection, 2016.

Geethapriya. S, N. Duraimurugan, S.P. Chokkalingam Real-Time Object Detection with Yolo, International Journal of Engineering and Advanced Technology (IJEAT), vol. 8, no. 3S, 2019.

Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan, Show and Tell: A Neural Image Caption Generator, google, 2015.

Touseef Iqbal, Shaima Qureshi, The survey: Text generation models in deep learning, Journal of King Saud University C Computer and Information Sciences, 2020.

Alex Sherstinsky, Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network, Physica D: Nonlinear Phenomena, vol. 404, 2020.

Yoon Kim, Convolutional Neural Networks for Sentence Classification, in Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 2014.

Yijun Wang, Pengyu Zhou, Wenya Zhong, An Optimization Strategy Based on Hybrid Algorithm of Adam and SGD, in MATEC Web of Conferences, 2018.

Natthapat Sotthisopha, Peerapon Vateekul, Improving Short Text Classification Using Fast Semantic Expansion on Multichannel Convolutional Neural Network, in International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), 2018.

Pengfei Liu, Xipeng Qiu, Xuanjing Huang, Recurrent neural network for text classification with multi-task learning, in IJCAI’16: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, 2016.

Ammar Ahmad Awan, Hari Subramoni, Dhabaleswar K. Panda, An In-depth Performance Characterization of CPU- and GPU-based DNN Training on Modern Architectures, in MLHPC’17: Proceedings of the Machine Learning on HPC Environments, 2017.

Behrouz Derakhshan, Alireza Rezaei Mahdiraji, Tilmann Rabl, Volker Markl, Continuous Deployment of Machine Learning Pipelines, in International Conference on Extending Database Technology (EDBT-2019), Lisbon Portugal, 2019

Linxuan Yu, Yeli Li, Qingtao Zeng, Yanxiong Sun, Yuning Bian, Wei He, Summary of web crawler technology research, Journal of Physics: Conference Series, vol. 1449, no. 1, 2020.

R. Suganya Devi , D. Manjula, R. K. Siddharth, An Efficient Approach for Web Indexing of Big Data through Hyperlinks in Web Crawling, The Scientific World Journal, vol. 2015 , p. 9, 2015.

Ramakrishnan Kannan, Hyenkyun Woo, Charu C. Aggarwal, Haesun Park, Outlier Detection for Text Data : An Extended Version, SIAM Data Mining Conference, 2017.

Peter Rousseeuw, Mia Hubert, Anomaly detection by robust statistics, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 8, no. 2, 2018.

Asma Khazaal Abdulsahib, Anomaly detection in text data that represented as a graph using dbscan algorithm, Journal of Theoretical and Applied Information Technology , vol. 95, no. 9, 2017.

Fei Tony Liu, Kai Ming Ting, Zhi-Hua Zhou, Isolation Forest, in 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 2008.