VIDEO SUMMARIZATION AND SCENE ANALYSIS

1 R Uma, 2 B Aparna, 3 J Sreeja, 4 B Architha, 5 B Nikhil

doi:10.62643/

Authors

1 R Uma, 2 B Aparna, 3 J Sreeja, 4 B Architha, 5 B Nikhil Author

DOI:

https://doi.org/10.62643/

Abstract

The exponential growth of online video content has made automated video summarization and scene analysis a pressing research problem in the intersection of computer vision and natural language processing. This project presents a multimodal generative AI pipeline for automated video summarization that combines visual understanding with natural language generation. The system integrates three pretrained deep learning models: OpenAI CLIP (Contrastive Language–Image Pretraining) for semantically-driven keyframe selection, OpenAI Whisper for automatic speech recognition and audio transcription, and Facebook BART-Large-CNN for abstractive text summarization. A YouTube video is ingested using yt-dlp, frames are extracted using OpenCV, and CLIP-based cosine similarity scoring identifies the top visual keyframes matching a semantic text prompt. Whisper transcribes the audio track, and BART generates a concise abstractive summary from the transcript. Evaluated on an educational YouTube lecture video, the system correctly identifies contextually relevant keyframes and produces a coherent two-sentence summary describing the video content. The pipeline is fully implemented in Python on Google Colab, demonstrating feasibility for deployment in e-learning, surveillance, and media archiving contexts.

VIDEO SUMMARIZATION AND SCENE ANALYSIS

Authors

DOI:

Abstract

Downloads

Published

Issue

Section

License

How to Cite

Make a Submission

IF

Index

Latest publications

Browse

Language

Information