Offensive Language Identification in Code-Mixed Social Media Posts/Comments

Proceedings of Technological Advances in Science, Medicine and Engineering Conference 2021

Offensive Language Identification in Code-Mixed Social Media Posts/Comments

Charangan Vasantharajan, Uthayasanker Thayasivam

Abstract

Code-mixed contents are a challenge of Natural Language Processing tasks due to their completely different characteristics that vary from the traditional structures of the languages. This gained the researchers' attention and also led them to involve in many tasks such as identifying different forms of such content (e.g., hate speech, emotions, and sentiments) and creating datasets. In this paper, we propose a novel approach called hypers to classify offensive content on code-mixed (Tamil-English) text from Social Media posts/comments into their corresponding class (Not-offensive, Offensive Targeted Insult Other, Offensive Targeted Insult Individual, Offensive Targeted Insult Group, not Tamil, and Offensive Untargeted), using bidirectional approach and fine-tuning strategies. We utilize the shared parameters of neural networks to map the posts/comments to common offensive class space and also used some preprocessing techniques to convert emoji into text, remove emoticons, lowercasing, remove numbers, some specific characters, etc. Our proposed model got a 0.73 F1 score and achieved state-of-the-art results in the particular task.

Keywords: code-mixing, offensive, Tamil

Last modified: 2021-06-24

Building: TASME Center
Room: Science Hall
Date: July 3, 2021 - 09:35 AM – 09:50 AM

<< Back to Proceedings