Skip to main navigation Skip to search Skip to main content

Developing a Hybrid Morphological Analyzer for Low-Resource Languages

Research output: Contribution to journalArticlepeer-review

Abstract

Morphological analysis is the fundamental and preliminary task for Natural Language Processing (NLP) applications, which involve speech and language. Kannada is a low-resource language belonging to the Dravidian language family, which is highly agglutinative and morphologically rich in nature, where dataset development is happening rapidly due to the increasing demands of NLP tools. This study presents a hybrid approach that integrates rule-based and Transformer-based techniques, aiming to maximize their strengths while minimizing the respective limitations. In the Kannada language, the analysis of inflections has been challenging due to morphological richness, and to address this issue, 85 paradigms are created using Lttoolbox of Apertium. Further, a Transformer model is trained with the generated nominal data to generate the morphological analysis for the out-of-vocabulary inflections. The hybrid approach can be easily extended to new words as they are added to the dictionary. The obtained results are on a test set for inflections in Kannada precision: 0.924; recall: 0.925; and F1 score: 0.925. The main contributions include rule extraction for paradigm design at the word level, morphological analysis for nouns, verbs, adjectives, pronouns, and indeclinables on a benchmark dataset and morphological analysis generation using the Transformer architecture.

Original languageEnglish
Article number5682
JournalApplied Sciences (Switzerland)
Volume15
Issue number10
DOIs
Publication statusPublished - 05-2025

All Science Journal Classification (ASJC) codes

  • General Materials Science
  • Instrumentation
  • General Engineering
  • Process Chemistry and Technology
  • Computer Science Applications
  • Fluid Flow and Transfer Processes

Fingerprint

Dive into the research topics of 'Developing a Hybrid Morphological Analyzer for Low-Resource Languages'. Together they form a unique fingerprint.

Cite this