ステップ1: Preprocess the data

th preprocess.lua -train_src data/src-train.txt -train_tgt data/tgt-train.txt -valid_src data/src-val.txt -valid_tgt data/tgt-val.txt -save_data data/demo

We will be working with some example data in data/ folder.

The data consists of parallel source (src) and target (tgt) data containing one sentence per line with tokens separated by a space:

  • src-train.txt
  • tgt-train.txt
  • src-val.txt
  • tgt-val.txt

Validation files are required and used to evaluate the convergence of the training. It usually contains no more than 5000 sentences.

$ head -n 3 data/src-train.txt
It is not acceptable that , with the help of the national bureaucracies , Parliament 's legislative prerogative should be made null and void by means of implementing provisions whose content , purpose and extent are not laid down in advance .
Federal Master Trainer and Senior Instructor of the Italian Federation of Aerobic Fitness , Group Fitness , Postural Gym , Stretching and Pilates; from 2004 , he has been collaborating with Antiche Terme as personal Trainer and Instructor of Stretching , Pilates and Postural Gym .
" Two soldiers came up to me and told me that if I refuse to sleep with them , they will kill me . They beat me and ripped my clothes .

After running the preprocessing, the following files are generated:

  • demo.src.dict: インデックス化されたソース用語辞書。
  • demo.tgt.dict: インデックス化されたターゲット用語辞書。
  • demo-train.t7: 用語辞書、学習データと検証データを含むシリアル化されたTorchファイル

The *.dict files are needed to check or reuse the vocabularies. これらのファイルはシンプルな人が読み取り可能な辞書である。

$ head -n 10 data/demo.src.dict
<blank> 1
<unk> 2
<s> 3
</s> 4
It 5
is 6
not 7
acceptable 8
that 9
, 10
with 11



If the corpus is not tokenized, you can use OpenNMT's tokenizer.

ステップ2: Train the model

th train.lua -data data/demo-train.t7 -save_model demo-model

メインの学習コマンドはとてもシンプルです。 最小限のデータファイルおよび保存ファイルしかありません。 これはエンコーダーとデコーダー両方とも、500の隠れユニットと2層からなるLSTMデフォルトモデルを走らせます。 GPU 1を使いたい場合-gpuid 1を付け加えます。

ステップ3: Translate

th translate.lua -model demo-model_epochX_PPL.t7 -src data/src-test.txt -output pred.txt

これで新しいデータの翻訳するモデルができました。 翻訳に使われるのはビームサーチです。 This will output predictions into pred.txt.


デモのデータが小さいので翻訳結果は良くはありません。 より大きなデータで試してみてください。 例えば何百万文の翻訳ペアまたは要約ペアがインターネット上でダウンロードできます。