Cross-Lingual Semantic Embeddings for Optimizing Real-Time Multilingual Language Identification

Authors

  • S. Laxmi Lalitha Author
  • Arelli Anjali Author
  • Bhukya Pravalika Author
  • Geetla Siddharth Reddy Author
  • Chilupuri Sanjith Author

DOI:

https://doi.org/10.62643/ijerst.2026.v22.n1(2).pp1-13

Keywords:

Multilingual communication, Language identification, Natural Language Processing, Data analytics, Transformer embeddings, machine learning.

Abstract

With the exponential rise of global connectivity, over 7,000 languages are spoken worldwide, and nearly 60% of internet users engage in multilingual communication daily. Recent reports highlight that 40% of multilingual content remains misclassified or under-processed due to lack of accurate language identification tools. Existing manual approaches to identifying language in text are tedious and error-prone, often failing when processing short or informal content. Similarly, conventional algorithms lack the semantic depth to capture subtle linguistic variations in multilingual datasets. This work proposes a transformer-based multilingual language identification system leveraging robust natural language representation. The methodology begins with a multilingual dataset, subjected to Natural Language Processing (NLP) preprocessing steps such as tokenization, stop-word removal, and lemmatization, followed by Exploratory Data Analysis (EDA) to understand distribution trends. Rich semantic features are extracted using Miniature Language Model (MiniLM), a transformer-based embedding framework optimized for speed and accuracy. For baseline comparison, traditional classifiers, including Decision Tree Classifier (DTC), K-Nearest Neighbors (KNN), and Gaussian Naïve Bayes Classifier (GNB), are tested. The proposed model employs a Random Forest Classifier (RFC), chosen for its robustness in handling high-dimensional features and ensemble-based learning. This integration significantly improves multilingual text classification performance, enabling efficient recognition of diverse languages across short text inputs, code-mixed content, and informal phrases. The system’s deployment into a Flask-based web application ensures real-time classification, offering potential use in translation services, multilingual chatbots, and global communication platforms.

Downloads

Published

20-03-2026

How to Cite

Cross-Lingual Semantic Embeddings for Optimizing Real-Time Multilingual Language Identification. (2026). International Journal of Engineering Research and Science & Technology, 22(1(2), 1-13. https://doi.org/10.62643/ijerst.2026.v22.n1(2).pp1-13