Regularizing Neural Machine Translation By Target-Bidirectional Agreement

4. Models with an iterative process. In each iteration, we fix the R2L model and use it as an aid to optimize the L2R model with equation 3, and at the same time, we repair the L2R model and use it as an aid to optimize the R2L model with equation 9. Iterative training continues until performance on the development kit does not increase. Experiment Configuration In order to study the effectiveness of our proposed approach, we conduct experiments on three data sets, including NIST OpenMT for Chinese English, WMT17 for English-German and Chinese-English. In all experiments, we use blue (Pa-pineni et al. 2002) as an automatic metric for translation evaluation. Datasets. For NIST OpenMT`s Chinese-English trans-mission task, we select our training data in LDC corpora,1, which consists of 2.6 million pairs of 65.1 million Chinese Figure 1: illustration of the common formation of NMT models in words or 67.1 million English words. Each set → – two directions (L2R P (y|x model) and the R2L model more than 80 words are removed from the drive data.

The evaluation kit ← NIST-NIST-OpenMT 2006 is used as validationS-P (y|x;). and NIST 2003, 2005, 2008, 2012-Datasets as test sets. We limit the vocabulary to 50K most often the R2L model can be defined as: words on the source page and landing page, and convert the remaining words to Tokens. In the decoding phase, we will follow ←” ← Luong et al. (2015) to manage the replacement. | For the German-English translation task of WMT17, we use the pre-processed training data provided by the orga-N task → ← nizers.2 The drive data consists of 5.8M pairs ||| in. P (y|x (n); (9) with 141M of English words and 134M in German respec- no.1 tively. We use newstest2016 as a validation game and N newstest2017 as a test.

Maximum length ← → – KL (P (y|x (n); -|| P (y|x (n); is set at 128. For vocabulary, we use 37K-Subface No. 1 based on Byte Pair Encoding (BPE) (Sennrich, Haddow, and the corresponding training procedure is similar to Algo-Birch 2016b). Rithm 1. For the Sino-English translation task of WMT17, we use Based on the L2R and R2L models mentioned above, we can act like all available parallel data, which consist of a common training process of 24M-Sen auxiliary systems for each other: pairs of Tence, including Commentary News, UN Parallel Cor- → – Eiter and CWMT Corpus.3 the newsdev2017 is used as model L2R P (y|x; ) as an auxiliary system validation game and newstest017 as a test game. We also limit ← to set the R2L P (y|x) and the R2L model the maximum length of play to 128. For pre-processing ←, we segment Chinese phrases with our own Chinese P (y|x; ) is used as an aid system to regulate L2R → – word segmentation tool and toqueate English phrases with the P (y|x 😉 model. This training process can be iteratively scripts in Moses.4 Then we learn a BPE model to get further improvements, because after pre-processed sentences with 32K fusion operations, in each iteration the two L2R and R2L models are supposed to be 44K and 33K tokens as a source to be improved with the regularization method.