VIDEO SUMMARIZATION AND SCENE ANALYSIS
DOI:
https://doi.org/10.62643/Abstract
The exponential growth of online video content has made automated video summarization and scene analysis a pressing research problem in the intersection of computer vision and natural language processing. This project presents a multimodal generative AI pipeline for automated video summarization that combines visual understanding with natural language generation. The system integrates three pretrained deep learning models: OpenAI CLIP (Contrastive Language–Image Pretraining) for semantically-driven keyframe selection, OpenAI Whisper for automatic speech recognition and audio transcription, and Facebook BART-Large-CNN for abstractive text summarization. A YouTube video is ingested using yt-dlp, frames are extracted using OpenCV, and CLIP-based cosine similarity scoring identifies the top visual keyframes matching a semantic text prompt. Whisper transcribes the audio track, and BART generates a concise abstractive summary from the transcript. Evaluated on an educational YouTube lecture video, the system correctly identifies contextually relevant keyframes and produces a coherent two-sentence summary describing the video content. The pipeline is fully implemented in Python on Google Colab, demonstrating feasibility for deployment in e-learning, surveillance, and media archiving contexts.
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.













