A Unified Multimodal Deep Learning System for Robust Deep-fake Analysis

B.V.Poorna Sree; D.Yashwanth; G.Raja; A.Hemanth Kumar; Prof.(Dr.)Ravi Kiran

doi:10.62643/ijerst.2026.v22.n1(2).pp219-223

Authors

B.V.Poorna Sree Author
D.Yashwanth Author
G.Raja Author
A.Hemanth Kumar Author
Prof.(Dr.)Ravi Kiran Author

DOI:

https://doi.org/10.62643/ijerst.2026.v22.n1(2).pp219-223

Keywords:

Deepfake Detection; Convolutional Neural Networks; GoogLeNet; Log-Mel Spectrogram; LSTM; Grad-CAM; Multimodal Fusion; Streamlit

Abstract

Deepfake media generated by generative adversarial networks (GANs) and neural voice synthesis pose an escalating threat to information integrity. This paper presents a multimodal deepfake detection system that jointly analyses video frames, facial images, and audio waveforms using three specialized deep learning branches: a GoogLeNet-based visual feature extractor coupled with an LSTM for temporal video modelling, a GoogLeNet image classifier with Grad-CAM explainability, and a four-layer convolutional neural network (CNN) trained on log-mel spectrograms for audio deepfake detection. A weighted fusion layer combines per-modality softmax scores into a single fakeness probability with a configurable decision threshold. The audio CNN achieves 94.2% test accuracy on binary real/fake classification; multimodal fusion further improves detection to 96.3% accuracy with F1- score 0.963. An interactive Streamlit application delivers realtime predictions with Grad-CAM attention overlays for analyst interpretability