ステップ1: Preprocess the data

python preprocess.py -train_src data/src-train.txt -train_tgt data/tgt-train.txt -valid_src data/src-val.txt -valid_tgt data/tgt-val.txt -save_data data/demo

We will be working with some example data in data/ folder.

The data consists of parallel source (src) and target (tgt) data containing one sentence per line with tokens separated by a space:

  • src-train.txt
  • tgt-train.txt
  • src-val.txt
  • tgt-val.txt

Validation files are required and used to evaluate the convergence of the training. It usually contains no more than 5000 sentences.

$ head -n 3 data/src-train.txt
It is not acceptable that , with the help of the national bureaucracies , Parliament 's legislative prerogative should be made null and void by means of implementing provisions whose content , purpose and extent are not laid down in advance .
Federal Master Trainer and Senior Instructor of the Italian Federation of Aerobic Fitness , Group Fitness , Postural Gym , Stretching and Pilates; from 2004 , he has been collaborating with Antiche Terme as personal Trainer and Instructor of Stretching , Pilates and Postural Gym .
" Two soldiers came up to me and told me that if I refuse to sleep with them , they will kill me . They beat me and ripped my clothes .

ステップ2: Train the model

python train.py -data data/demo -save_model demo-model

メインの学習コマンドはとてもシンプルです。 最小限のデータファイルおよび保存ファイルしかありません。 これはエンコーダーとデコーダー両方とも、500の隠れユニットと2層からなるLSTMデフォルトモデルを走らせます。 If you want to train on GPU, you need to set, as an example: CUDA_VISIBLE_DEVICES=1,3 -world_size 2 -gpu_ranks 0 1 to use (say) GPU 1 and 3 on this node only. To know more about distributed training on single or multi nodes, read the FAQ section.

ステップ3: Translate

python translate.py -model demo-model_XYZ.pt -src data/src-test.txt -output pred.txt -replace_unk -verbose

これで新しいデータの翻訳するモデルができました。 翻訳に使われるのはビームサーチです。 This will output predictions into pred.txt.


デモのデータが小さいので翻訳結果は良くはありません。 より大きなデータで試してみてください。 例えば何百万文の翻訳ペアまたは要約ペアがインターネット上でダウンロードできます。