PhD Thesis
Announcements of the last step towards the PhD
Exploiting lexical information and discriminative alignment training in statistical machine translation
Advisor: Dr. Rafael E. Banchs MartÃnez
Tutor: Dra. Núria Castell Ariño
Summary: The thesis work mainly focused on three aspects of statistical machine translation: the use of lexical information like basic lexical models and multi-word expressions, minimum error training strategies and word alignment models. These aspects were addressed within the n-gram-based machine translation framework. In this approach, the joint translation probability is modelled via a log-linear combination of a bilingual n-gram model and additional feature functions.
First, a thorough study of word alignment evaluation is carried out. We stress the impact on the scores of the way alignment test data are scribed. After this, we evaluate the impact on alignment quality of linguistic classifications like lemmatising, stemming or verb classification. Although these transformations have a large positive impact on word alignment, we report that this improvement has no effect on translation quality. We also examine the impact on word alignment quality and translation accuracy of grouping data-inferred multi-word expressions before alignment.
Another objective of this build and we give guidelines for manual alignment. The n-gram-based machine translation system is then this was the improvement of minimum error training strategies. Two research lines were considered: the choice of the metric used as objective function and the improvement of the optimisation algorithm itself. In the first research line, parameters were successfully tuned with respect to the Queen score of the Qarla framework, a framework which combines different metrics with a stable and robust criterion. In the second line, the Simultaneous Perturbation Stochastic Approximation algorithm and the downhill simplex method were compared for this parameter optimisation task.
Finally, we propose a novel framework for discriminative training of alignment models with automated translation metrics as maximisation criterion. In order to evaluate this framework, we implemented an alignment system based on discriminative models adapted to the n-gram-based translation system, and we observed a clear improvement of automated translation scores on small corpora. We extended this framework to large corpora, tuning the alignment system parameters on a small part of the corpus, and using them to align the whole corpus. The obtained parameters were able to produce at least as good machine translation systems as with standard word alignment tools, but in a more flexible way and with less computational resource requirements.
Date: 25th of AprilFirst, a thorough study of word alignment evaluation is carried out. We stress the impact on the scores of the way alignment test data are scribed. After this, we evaluate the impact on alignment quality of linguistic classifications like lemmatising, stemming or verb classification. Although these transformations have a large positive impact on word alignment, we report that this improvement has no effect on translation quality. We also examine the impact on word alignment quality and translation accuracy of grouping data-inferred multi-word expressions before alignment.
Another objective of this build and we give guidelines for manual alignment. The n-gram-based machine translation system is then this was the improvement of minimum error training strategies. Two research lines were considered: the choice of the metric used as objective function and the improvement of the optimisation algorithm itself. In the first research line, parameters were successfully tuned with respect to the Queen score of the Qarla framework, a framework which combines different metrics with a stable and robust criterion. In the second line, the Simultaneous Perturbation Stochastic Approximation algorithm and the downhill simplex method were compared for this parameter optimisation task.
Finally, we propose a novel framework for discriminative training of alignment models with automated translation metrics as maximisation criterion. In order to evaluate this framework, we implemented an alignment system based on discriminative models adapted to the n-gram-based translation system, and we observed a clear improvement of automated translation scores on small corpora. We extended this framework to large corpora, tuning the alignment system parameters on a small part of the corpus, and using them to align the whole corpus. The obtained parameters were able to produce at least as good machine translation systems as with standard word alignment tools, but in a more flexible way and with less computational resource requirements.
Time: 11h
Place: Aula Teleensenyament de l'edifici B3
Campus Nord.
Press Contact
ilapuente@lsi.upc.edu
