Object Detection-Driven Image Captioning: Integrating YOLO with Natural Language Processing
DOI:
https://doi.org/10.62643/Abstract
Image captioning is a significant interdisciplinary research domain that bridges computer vision and natural language processing (NLP) to automatically generate descriptive textual representations of visual content. This paper presents an object detection-driven image captioning framework that integrates the You Only Look Once (YOLO) algorithm with deep learning-based natural language models to produce syntactically and semantically accurate captions. By leveraging YOLO's real-time object detection capabilities, the proposed system identifies key visual components within an image, which are subsequently processed by a Long Short-Term Memory (LSTM)-based language model augmented with an attention mechanism. The model is trained and evaluated on the MS COCO dataset, demonstrating competitive performance against existing CNN-RNN baselines in terms of BLEU, METEOR, and CIDEr scores. This approach enhances contextual relevance, improves object-level precision in captions, and holds practical applicability in assistive technologies for the visually impaired and autonomous driving systems.
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.













