Object Detection-Driven Image Captioning: Integrating YOLO with Natural Language Processing

Authors

  • Mr. D Koteswara Rao1 , V Krishna Pratap2 , CH Rambabu3 Author
  • Mr. D Koteswara Rao1 , V Krishna Pratap2 , CH Rambabu3 Author

DOI:

https://doi.org/10.62643/

Abstract

Image captioning is a significant interdisciplinary research domain that bridges computer vision and natural language processing (NLP) to automatically generate descriptive textual representations of visual content. This paper presents an object detection-driven image captioning framework that integrates the You Only Look Once (YOLO) algorithm with deep learning-based natural language models to produce syntactically and semantically accurate captions. By leveraging YOLO's real-time object detection capabilities, the proposed system identifies key visual components within an image, which are subsequently processed by a Long Short-Term Memory (LSTM)-based language model augmented with an attention mechanism. The model is trained and evaluated on the MS COCO dataset, demonstrating competitive performance against existing CNN-RNN baselines in terms of BLEU, METEOR, and CIDEr scores. This approach enhances contextual relevance, improves object-level precision in captions, and holds practical applicability in assistive technologies for the visually impaired and autonomous driving systems.

Downloads

Published

31-05-2026

How to Cite

Object Detection-Driven Image Captioning: Integrating YOLO with Natural Language Processing. (2026). International Journal of Engineering Research and Science & Technology, 22(2), 3043-3047. https://doi.org/10.62643/