Deep Learning-Based Image Captioning Using CNN-LSTM Architecture for Automated Visual Understanding

SENAPATHI SATYA BHARGAVI, K. Venkatesh

doi:10.62643/

Authors

SENAPATHI SATYA BHARGAVI, K. Venkatesh Author

DOI:

https://doi.org/10.62643/

Keywords:

Image Captioning, Deep Learning, CNN, LSTM, Computer Vision, Natural Language Processing, Feature Extraction, Sequence Modeling, Flickr Dataset, VizWiz Dataset

Abstract

In the rapidly evolving field of artificial intelligence, enabling machines to interpret visual data and generate meaningful textual descriptions remains a challenging yet impactful problem. This project presents a deep learning-based approach for automated image captioning using a hybrid Convolutional Neural Network (CNN) and Long ShortTerm Memory (LSTM) architecture. The system is designed to bridge the gap between computer vision and natural language processing by transforming visual inputs into coherent human-readable sentences.The proposed model utilizes CNN as a feature extractor to capture spatial information from images. These features are then encoded into a compact representation, which is passed to an LSTM network responsible for generating sequential textual descriptions. Unlike traditional methods that rely on handcrafted features, this approach leverages deep learning to automatically learn hierarchical representations, thereby improving accuracy and contextual understanding.The system is implemented using a graphical user interface (GUI) developed with Tkinter, allowing users to upload datasets, preprocess images and captions, train models, and generate captions for unseen images. The project supports multiple datasets, including Flickr and VizWiz, enhancing its robustness and adaptability. The preprocessing stage involves tokenization, padding, and vocabulary construction to convert textual data into numerical form suitable for model training.Performance evaluation is conducted using standard metrics such as accuracy, precision, recall, and F1-score. Experimental results demonstrate that the CNN-LSTM model achieves significant performance in generating meaningful captions, with improved generalization across different datasets. Additionally, the system integrates text-to-speech functionality, enabling generated captions to be converted into audio, thereby enhancing accessibility for visually impaired users.The inclusion of visualization tools such as training accuracy and loss graphs provides insights into model performance and convergence behavior. The project also compares results across multiple datasets, highlighting the effectiveness of the proposed approach in diverse scenarios.Overall, this work contributes to the advancement of intelligent systems capable of understanding and describing visual content. The proposed solution has potential applications in assistive technologies, content-based image retrieval, automated image annotation, and smart surveillance systems. By combining the strengths of CNN and LSTM architectures, the system demonstrates a scalable and efficient framework for real-world image captioning tasks.