A Unified Multimodal Deep Learning System for Robust Deep-fake Analysis
DOI:
https://doi.org/10.62643/ijerst.2026.v22.n1(2).pp219-223Keywords:
Deepfake Detection; Convolutional Neural Networks; GoogLeNet; Log-Mel Spectrogram; LSTM; Grad-CAM; Multimodal Fusion; StreamlitAbstract
Deepfake media generated by generative adversarial networks (GANs) and neural voice synthesis pose an escalating threat to information integrity. This paper presents a multimodal deepfake detection system that jointly analyses video frames, facial images, and audio waveforms using three specialized deep learning branches: a GoogLeNet-based visual feature extractor coupled with an LSTM for temporal video modelling, a GoogLeNet image classifier with Grad-CAM explainability, and a four-layer convolutional neural network (CNN) trained on log-mel spectrograms for audio deepfake detection. A weighted fusion layer combines per-modality softmax scores into a single fakeness probability with a configurable decision threshold. The audio CNN achieves 94.2% test accuracy on binary real/fake classification; multimodal fusion further improves detection to 96.3% accuracy with F1- score 0.963. An interactive Streamlit application delivers realtime predictions with Grad-CAM attention overlays for analyst interpretability
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.













