A Comprehensive Evaluation of ViLT and CLIP Transformers for Multimodal Fake News Detection with Metadata Integration

Authors

  • Ahmed Abd Alhamid college of computing and information technology, Arab Academy for Science, Technology, and Maritime Transport, Cairo, Egypt.
  • Mohamed Waleed Fakhr Head of Computing School, Coventry University, TKH Branch, Cairo, Egypt.
  • Mahmoud M Ashry Head of the Scientific Departments, College of Computing and Information Technology (Smart Village), Arab Academy for Science, Technology, and Maritime Transport, Cairo, Egypt.
  • Ahmed Abd Elhafeez Faculty of Computer and Information Technology, Innovation University, Cairo, Egypt. Applied Science Research Center, Applied Science Private University, Amman, Jordan.

DOI:

https://doi.org/10.19139/soic-2310-5070-4227

Keywords:

Multimodal Fake News Detection, CLIP, ViLT, Cross-Attention Fusion, Fakeddit, FakeNewsNet, Hardware Constrained Machine Learning

Abstract

The swift expansion of social media has fueled the prevalence of multimodal misinformation wherein textualinformation, visual material, and context combine to influence perception. In this research, we compare two vision-languagetransformer-based systems, namely, ViLT and CLIP, in the task of detecting multimodal fake news using two benchmarkdatasets Fakeddit and FakeNewsNet. These evaluations include binary as well as three-class classifications in both scenariosof metadata available and metadata-free. Furthermore, Late Fusion and Gated Fusion approaches are considered in order toexplore the effectiveness of complementation through transformers. The results indicate that the effectiveness of metadataintegration depends on the architecture and the dataset. ViLT can gain from metadata integration due to the presence of theunified transformer architecture enabling the interaction between textual, visual, and contextual features inside the attentionmechanism. On the contrary, CLIP exhibits better behavior in semantic image-text alignment, especially when there is nometadata. ViLT has an accuracy score of 0.906 for the FakeNewsNet data set and 0.831 for the binary version of Fakeddittask, while CLIP obtains competitive accuracy scores for the three-class version of the same task without metadata. However,the improvements introduced by metadata were small for several cases, which suggests that metadata should not be taken asuniversally effective

Downloads

Published

2026-07-01

How to Cite

Abd Alhamid, A., Fakhr, M. W., Ashry, M. M., & Abd Elhafeez, A. (2026). A Comprehensive Evaluation of ViLT and CLIP Transformers for Multimodal Fake News Detection with Metadata Integration. Statistics, Optimization & Information Computing. https://doi.org/10.19139/soic-2310-5070-4227

Issue

Section

Research Articles

Categories