VIDEO SUMMARIZATION AND SCENE ANALYSIS

Authors

  • 1 R Uma, 2 B Aparna, 3 J Sreeja, 4 B Architha, 5 B Nikhil Author

DOI:

https://doi.org/10.62643/

Abstract

The exponential growth of online video content has made automated video summarization and scene analysis a pressing research problem in the intersection of computer vision and natural language processing. This project presents a multimodal generative AI pipeline for automated video summarization that combines visual understanding with natural language generation. The system integrates three pretrained deep learning models: OpenAI CLIP (Contrastive Language–Image Pretraining) for semantically-driven keyframe selection, OpenAI Whisper for automatic speech recognition and audio transcription, and Facebook BART-Large-CNN for abstractive text summarization. A YouTube video is ingested using yt-dlp, frames are extracted using OpenCV, and CLIP-based cosine similarity scoring identifies the top visual keyframes matching a semantic text prompt. Whisper transcribes the audio track, and BART generates a concise abstractive summary from the transcript. Evaluated on an educational YouTube lecture video, the system correctly identifies contextually relevant keyframes and produces a coherent two-sentence summary describing the video content. The pipeline is fully implemented in Python on Google Colab, demonstrating feasibility for deployment in e-learning, surveillance, and media archiving contexts.

Downloads

Published

12-06-2026

How to Cite

VIDEO SUMMARIZATION AND SCENE ANALYSIS. (2026). International Journal of Engineering Research and Science & Technology, 22(2(1), 2336-2343. https://doi.org/10.62643/