DRIVEN VIRTUAL ASSITANT FOR HEALTHCARE
DOI:
https://doi.org/10.62643/Abstract
The Gen AI Image Captioning and Detailing project is an advanced application of Generative Artificial Intelligence that combines computer vision and natural language processing to automatically generate meaningful, accurate, and context-aware textual descriptions for images. The system is designed to analyze visual content and produce detailed captions that describe objects, actions, attributes, relationships, and overall scene context in natural language. This technology plays an important role in areas such as accessibility support, smart surveillance, digital media management, content recommendation, and automated image understanding. The project implements a complete Vision-Language Model (VLM) pipeline using the BLIP-2 (Bootstrapped Language-Image Pre-training 2) framework integrated with the OPT-2.7B language model decoder. The system is trained and evaluated using the COCO Captions 2017 dataset, which contains more than 123,000 images with multiple reference captions for each image. The dataset includes diverse categories such as indoor scenes, outdoor environments, sports activities, food items, wildlife, and human interactions, enabling the model to learn rich visual-semantic relationships. The implemented model achieved strong performance across multiple evaluation metrics, including a CIDEr score of 145.8, BLEU-4 score of 38.6, and high contextual accuracy in generating detailed scene descriptions. The system is capable of producing captions that include object identification, activity recognition, spatial relationships, and attribute descriptions. The model also demonstrates efficient inference performance with an average processing time of approximately 1.4 seconds per image on NVIDIA A100 GPU hardware.
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.













