REASONING-DRIVEN HUMAN–OBJECT INTERACTION DETECTION USING YOLO-BASED LOCALIZATION AND VISION–LANGUAGE INTELLIGENCE
DOI:
https://doi.org/10.62643/ijerst.2026.v22.n1(2).pp233-237Keywords:
Human–Object Interaction; YOLOv8; Spatial Reasoning; Vision–Language Model; BLIP-2; Object Detection; Computer VisionAbstract
Human–Object Interaction (HOI) detection is a fundamental task in computer vision that aims to identify meaningful relationships between people and surrounding objects. This paper presents a reasoning-driven HOI detection system that integrates YOLOv8n-based object localization with a multi-tier spatial reasoning engine and a Vision–Language Model (VLM) backend. The proposed pipeline detects persons and objects, extracts candidate pairs through Euclidean distance and Intersection-over-Union (IoU) metrics, and applies rule-based and semantically grounded reasoning to classify interactions. On a diverse evaluation set spanning sports, office, and daily-living scenes, the system achieves 87.3% precision and 82.6% recall at a 120-pixel distance threshold, outperforming baseline IoU-only methods by 14.2 percentage points in F1- score. The modular design supports plug-in VLM reasoning (BLIP-2) and is deployed as a Flask web application, enabling real-time visual explanations.
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.













