Domain Adaptation in Statistical Machine Translation

TitleDomain Adaptation in Statistical Machine Translation
Publication TypeMaster Theses
Year of Publication2007
AuthorsMavroeidis, D
Number of Pages52
UniversityThe University of Edinburgh
Thesis Typemasters

Human beings are capable of categorizing a document based on its topic. Computers are already able to perform very well on that task. However, when translating from one language to another, the human translator will use this knowledge to adapt the writing style and vocabulary for the translation to sound as natural as possible. Statistical Machine Translation (SMT) uses Probabilistic Machine Learning methods to perform translations. However, such systems do not perform well in domains different from the ones used to train them. How can the ability to recognize the topic of a document be captured by an SMT system to perform better? Methodologies for adapting a Statistical Machine Translation System to a specific domain are explored. Two methods are examined. The one mixes translation and language models, weighting them appropriately to improve translation quality. The other uses unsupervised methods to cluster a corpus into sub-corpora, train them individually and decode on a specific trained cluster according to the genre or “domain” of the new sentence to be translated. Experimentation showed improvement in translation quality using both methods. Training on a small domain-specific corpus and a large general one, can improve the performance on translating documents in the small corpus’ domain.