mBERT based model for identification of offensive content in south Indian languages

  • Shankar Biradar*
  • , Sunil Saumya
  • , Arun Chauhan
  • *Corresponding author for this work

Research output: Contribution to journalConference articlepeer-review

2 Citations (Scopus)

Abstract

In recent years, there has been a lot of focus on offensive content. The amount of offensive content generated by social media is increasing at an alarming rate. It created a greater need to address this issue than ever before. To address these issues, the organizers of “Dravidian-Code Mixed HASOC-2021” have created two challenges. Task 1 involves identifying offensive content in Malayalam data, whereas Task 2 includes Malayalam and Tamil Code Mixed Sentences. Our team participated in Task 2. We used multilingual BERT to extract features in our proposed model, and we used two different classifiers, Support Vector Machine (SVM) and Deep Neural Network (DNN), on the extracted features. In addition, we used the proposed data to evaluate the performance of a monolingual BERT classifier. Our best performing model monolingual Bert received a weighted F1 score of 0.70 for Malayalam data, ranking fifth; we also received a weighted F1 score of 0.573 for Tamil Code Mixed data, ranking twelfth.

Original languageEnglish
Pages (from-to)680-687
Number of pages8
JournalCEUR Workshop Proceedings
Volume3159
Publication statusPublished - 2021
EventWorking Notes of FIRE - 13th Forum for Information Retrieval Evaluation, FIRE-WN 2021 - Gandhinagar, India
Duration: 13-12-202117-12-2021

All Science Journal Classification (ASJC) codes

  • General Computer Science

Fingerprint

Dive into the research topics of 'mBERT based model for identification of offensive content in south Indian languages'. Together they form a unique fingerprint.

Cite this