Sub wording Techniques for Neural Machine Translation between English-Tamil Language pair
Janarthanasarma Baskarakurukkal, Uthayasanker Thayasivam

In recent years, neural machine translation(NMT) has displayed substantial improvement in machine translation tasks, outperforming its predecessor statistical machine translation(SMT) for many language pairs. Currently, it has got massive attention from both academia and industry and reckoned as the promising direction for future exploration of machine translation. Even though it has produced human-like performance for many language pairs, in reference to the translations involving low-resource and morphologically rich languages, neural machine translation is yet to achieve a desirable quality. 

In the context of English-Tamil machine translation, translation quality still remains low due to two challenges: language dissimilarity and lack of available parallel corpus. Tamil differs from English by its morphology (English: fusional, Tamil: agglutinative), word order (English: SVO, Tamil: SOV) and grammatical structures. Tamil is a morphologically rich language. Hence, there exists a lot of inflectional forms of the same noun and verb base words.Therefore, a  large amount of parallel training corpora is required to cover the entire Tamil surface word  forms. At the same time, Tamil is also a low- resource language.Generally,  NMT systems need a large amount of parallel corpus for training and when they  don’t have them, they show significantly poor performance. Tamil being both a morphologically-rich language and a low resource language  makes  English-Tamil NMT, even more difficult.

 Generally, NMT systems are trained on a limited, fixed size vocabulary as training complexity increases with vocabulary size. But translation is an open-vocabulary problem. The models trained on limited vocabulary are incapable of  translating words which are not in the vocabulary, aka out-of-vocabulary words. Since Tamil is highly inflected and a low-resource language, it suffers from OOV problem significantly. So, handling OOV  words is very crucial for improving the quality of  translation. To address this issue, techniques called subwording are introduced. In subwording,  subwords which are between a word and character are used as the base tokens.Initially, Senrich et al introduced sub wording techniques based on Byte Pair Encoding(BPE). Subsequently, many sub wording techniques like WordPieces, SentencePieces and morphological segmentation have been proposed.

For English-Tamil machine translation, sub wording techniques like suffix splitting and morphological segmentation  have been used with significant success in statistical machine translation. With NMT, BPE and morphological segmentation have been used, but their contribution in the performance of MT systems is not analysed.Further, there are no comparative studies on sub wording techniques for English- Tamil neural machine translation. In our research, we have explored and empirically compared different sub wording techniques such as BPE, SentencePieces and Orthographic Syllables with different configurations. We found that subwording techniques improve the translation quality and,on average, SentencePieces model works best for English-Tamil language pairs. Experiments also demonstrated that the performance of  different techniques is dependent on  datasets and also in translation direction.

