A Comprehensive Evaluation of ViLT and CLIP Transformers for Multimodal Fake News Detection with Metadata Integration
DOI:
https://doi.org/10.19139/soic-2310-5070-4227Keywords:
Multimodal Fake News Detection, CLIP, ViLT, Cross-Attention Fusion, Fakeddit, FakeNewsNet, Hardware Constrained Machine LearningAbstract
The swift expansion of social media has fueled the prevalence of multimodal misinformation wherein textualinformation, visual material, and context combine to influence perception. In this research, we compare two vision-languagetransformer-based systems, namely, ViLT and CLIP, in the task of detecting multimodal fake news using two benchmarkdatasets Fakeddit and FakeNewsNet. These evaluations include binary as well as three-class classifications in both scenariosof metadata available and metadata-free. Furthermore, Late Fusion and Gated Fusion approaches are considered in order toexplore the effectiveness of complementation through transformers. The results indicate that the effectiveness of metadataintegration depends on the architecture and the dataset. ViLT can gain from metadata integration due to the presence of theunified transformer architecture enabling the interaction between textual, visual, and contextual features inside the attentionmechanism. On the contrary, CLIP exhibits better behavior in semantic image-text alignment, especially when there is nometadata. ViLT has an accuracy score of 0.906 for the FakeNewsNet data set and 0.831 for the binary version of Fakeddittask, while CLIP obtains competitive accuracy scores for the three-class version of the same task without metadata. However,the improvements introduced by metadata were small for several cases, which suggests that metadata should not be taken asuniversally effectiveDownloads
Published
2026-07-01
How to Cite
Abd Alhamid, A., Fakhr, M. W., Ashry, M. M., & Abd Elhafeez, A. (2026). A Comprehensive Evaluation of ViLT and CLIP Transformers for Multimodal Fake News Detection with Metadata Integration. Statistics, Optimization & Information Computing. https://doi.org/10.19139/soic-2310-5070-4227
License
Copyright (c) 2026 Ahmed Abd Alhamid, Mohamed Waleed Fakhr, Mahmoud M Ashry, Ahmed Abd Elhafeez

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).