Proceedings of Technological Advances in Science, Medicine and Engineering Conference 2021

Suitable Approach for Building End-to-End Tamil language Chatbot for Closed Domain
Kumaran Kugathasan, Uthayasanker Thayasivam

Providing a great customer experience is key to growing any business. That is why a lot of businesses employ customer service agents to handle client queries. But nowadays we see a trend where businesses are readily starting to employ chatbots instead of expensive customer service agents. But coming up with a chatbot is only comparatively easier for businesses that serve customers who speak high-resource language such as English. 

Languages that have several lexical, syntactic, semantic, task-specific resources and large corpora on various domains are classified as high-resource languages. Dictionaries, dependency tree corpora, semantic databases, part-of-speech(POS) tagger, named-entity recognizer(NER) are examples of such resources. Low-resourced languages are the languages that lack the corpora and other resources that are abundant in high-resource languages. The Tamil language belongs to the low-resourced language group.

There are several paid as well as open-source chatbot frameworks available for high-resource languages and lots of new research are conducted to improve the chatbot technology due to the abundance of resources. But for  Tamil, there is no such framework support. None of the available chatbot frameworks such as Rasa, Dialogflow, Microsoft bot framework, Facebook Bot Engine supports Tamil.  The approaches proposed in research for building high-resource language chatbots are not suitable for Tamil due to the lack of many language-related resources. The Tamil language lacks resources like corpora and other NLP tools required to build an effective chatbot. Tamil has no standardized POS Tagger, morphological Analyser and domain-specific NER. High inflexion and free word order pose key challenges to Tamil chatbots. Currently available Tamil chatbots dominantly suffer from these challenges even for a closed domain. Hence developing an effective End-to-End chat system for the Tamil language is a challenge. 

In this paper, we propose the most suitable strategy to generate a dataset and the most effective methodology that can be employed to build a Tamil chatbot for the closed domain. Based on our research, it was found that scraping data from existing FAQ sections of Tamil domain related websites and using a small group of native speakers to expand the dataset to incorporate the morphological richness was found to be the ideal data generation approach. We then explored the suitable methodology for building a Tamil chatbot out of existing options and concluded machine learning-based retrieval approach is the most suitable methodology compared to traditional approach like parsing, AIML, pattern matching, etc. Our findings are not only useful for developing an effective Tamil chatbot but can also be adapted for similar low-resource languages.

Keywords: Tamil, Chatbot, Low-resource

Last modified: 2021-06-24
Building: TASME Center
Room: Science Hall
Date: July 3, 2021 - 10:05 AM – 10:20 AM

<< Back to Proceedings