A Comprehensive Evaluation of ViLT and CLIP Transformers for Multimodal Fake News Detection with Metadata Integration

Ahmed Abd Alhamid; Mohamed Waleed Fakhr; Mahmoud M Ashry; Ahmed Abd Elhafeez

doi:10.19139/soic-2310-5070-4227

A Comprehensive Evaluation of ViLT and CLIP Transformers for Multimodal Fake News Detection with Metadata Integration

Authors

Ahmed Abd Alhamid college of computing and information technology, Arab Academy for Science, Technology, and Maritime Transport, Cairo, Egypt.
Mohamed Waleed Fakhr Head of Computing School, Coventry University, TKH Branch, Cairo, Egypt.
Mahmoud M Ashry Head of the Scientific Departments, College of Computing and Information Technology (Smart Village), Arab Academy for Science, Technology, and Maritime Transport, Cairo, Egypt.
Ahmed Abd Elhafeez Faculty of Computer and Information Technology, Innovation University, Cairo, Egypt. Applied Science Research Center, Applied Science Private University, Amman, Jordan.

DOI:

https://doi.org/10.19139/soic-2310-5070-4227

Keywords:

Multimodal Fake News Detection, CLIP, ViLT, Cross-Attention Fusion, Fakeddit, FakeNewsNet, Hardware Constrained Machine Learning

Abstract

The swift expansion of social media has fueled the prevalence of multimodal misinformation wherein textualinformation, visual material, and context combine to influence perception. In this research, we compare two vision-languagetransformer-based systems, namely, ViLT and CLIP, in the task of detecting multimodal fake news using two benchmarkdatasets Fakeddit and FakeNewsNet. These evaluations include binary as well as three-class classifications in both scenariosof metadata available and metadata-free. Furthermore, Late Fusion and Gated Fusion approaches are considered in order toexplore the effectiveness of complementation through transformers. The results indicate that the effectiveness of metadataintegration depends on the architecture and the dataset. ViLT can gain from metadata integration due to the presence of theunified transformer architecture enabling the interaction between textual, visual, and contextual features inside the attentionmechanism. On the contrary, CLIP exhibits better behavior in semantic image-text alignment, especially when there is nometadata. ViLT has an accuracy score of 0.906 for the FakeNewsNet data set and 0.831 for the binary version of Fakeddittask, while CLIP obtains competitive accuracy scores for the three-class version of the same task without metadata. However,the improvements introduced by metadata were small for several cases, which suggests that metadata should not be taken asuniversally effective

Downloads

Published

2026-07-01

How to Cite

Abd Alhamid, A., Fakhr, M. W., Ashry, M. M., & Abd Elhafeez, A. (2026). A Comprehensive Evaluation of ViLT and CLIP Transformers for Multimodal Fake News Detection with Metadata Integration. Statistics, Optimization & Information Computing. https://doi.org/10.19139/soic-2310-5070-4227

Download Citation

Issue

Online First

Section

Research Articles

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Authors who publish with this journal agree to the following terms:

Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).

A Comprehensive Evaluation of ViLT and CLIP Transformers for Multimodal Fake News Detection with Metadata Integration

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

Categories

License

Make a Submission

Information