GENAI VOICE TO TEXT TRANSFORMER
DOI:
https://doi.org/10.62643/Abstract
The GENAI Voice to Text Transformer project is an advanced Automatic Speech Recognition (ASR) system developed to convert spoken audio into accurate textual output using Generative Artificial Intelligence and Transformer-based deep learning architectures. Traditional speech-to-text systems are mainly designed for conversational speech and often struggle with complex audio patterns such as music lyrics, varying speech tempos, background noise, accents, and instrumental interruptions. This project addresses these challenges by integrating Generative AI with customized temporal processing and intelligent audio segmentation techniques to provide highly accurate and context-aware transcription capabilities for both speech and lyrical content. At the core of the system is the Faster-Whisper architecture, an optimized implementation of OpenAI’s Whisper model powered by the CTranslate2 inference engine. The model utilizes a Transformer-based sequence-to-sequence framework capable of understanding contextual relationships between audio segments and predicting meaningful text outputs instead of simply matching phonemes. This enables the system to achieve high transcription accuracy even in noisy environments and across diverse accents and speaking styles. The optimized Faster-Whisper implementation also allows efficient real-time processing on standard consumer hardware without requiring expensive cloud-based GPU infrastructure. One of the major innovations of the project is the introduction of a specialized Lyrics Mode using Temporal Gap Analysis and Voice Activity Detection (VAD). Unlike standard transcription systems that generate continuous blocks of text, the proposed system intelligently analyzes silence durations within audio streams to structure lyrical transcriptions naturally. Short pauses automatically create line breaks representing song bars, while longer pauses generate stanza or verse separations, producing organized poetic lyric formatting suitable for musical content and creative applications.
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.













