Train

train.py

usage: train.py [-h] [-config CONFIG] [-save_config SAVE_CONFIG]
                [--src_word_vec_size SRC_WORD_VEC_SIZE]
                [--tgt_word_vec_size TGT_WORD_VEC_SIZE]
                [--word_vec_size WORD_VEC_SIZE] [--share_decoder_embeddings]
                [--share_embeddings] [--position_encoding]
                [--feat_merge {concat,sum,mlp}]
                [--feat_vec_size FEAT_VEC_SIZE]
                [--feat_vec_exponent FEAT_VEC_EXPONENT]
                [--model_type {text,img,audio}] [--model_dtype {fp32,fp16}]
                [--encoder_type {rnn,brnn,mean,transformer,cnn}]
                [--decoder_type {rnn,transformer,cnn}] [--layers LAYERS]
                [--enc_layers ENC_LAYERS] [--dec_layers DEC_LAYERS]
                [--rnn_size RNN_SIZE] [--enc_rnn_size ENC_RNN_SIZE]
                [--dec_rnn_size DEC_RNN_SIZE]
                [--audio_enc_pooling AUDIO_ENC_POOLING]
                [--cnn_kernel_width CNN_KERNEL_WIDTH]
                [--input_feed INPUT_FEED] [--bridge]
                [--rnn_type {LSTM,GRU,SRU}] [--brnn]
                [--context_gate {source,target,both}]
                [--global_attention {dot,general,mlp,none}]
                [--global_attention_function {softmax,sparsemax}]
                [--self_attn_type SELF_ATTN_TYPE]
                [--max_relative_positions MAX_RELATIVE_POSITIONS]
                [--heads HEADS] [--transformer_ff TRANSFORMER_FF]
                [--copy_attn] [--copy_attn_type {dot,general,mlp,none}]
                [--generator_function {softmax,sparsemax}] [--copy_attn_force]
                [--reuse_copy_attn] [--copy_loss_by_seqlength]
                [--coverage_attn] [--lambda_coverage LAMBDA_COVERAGE]
                [--loss_scale LOSS_SCALE] --data DATA
                [--save_model SAVE_MODEL]
                [--save_checkpoint_steps SAVE_CHECKPOINT_STEPS]
                [--keep_checkpoint KEEP_CHECKPOINT]
                [--gpuid [GPUID [GPUID ...]]]
                [--gpu_ranks [GPU_RANKS [GPU_RANKS ...]]]
                [--world_size WORLD_SIZE] [--gpu_backend GPU_BACKEND]
                [--gpu_verbose_level GPU_VERBOSE_LEVEL]
                [--master_ip MASTER_IP] [--master_port MASTER_PORT]
                [--seed SEED] [--param_init PARAM_INIT] [--param_init_glorot]
                [--train_from TRAIN_FROM]
                [--reset_optim {none,all,states,keep_states}]
                [--pre_word_vecs_enc PRE_WORD_VECS_ENC]
                [--pre_word_vecs_dec PRE_WORD_VECS_DEC] [--fix_word_vecs_enc]
                [--fix_word_vecs_dec] [--batch_size BATCH_SIZE]
                [--batch_type {sents,tokens}] [--normalization {sents,tokens}]
                [--accum_count ACCUM_COUNT] [--valid_steps VALID_STEPS]
                [--valid_batch_size VALID_BATCH_SIZE]
                [--max_generator_batches MAX_GENERATOR_BATCHES]
                [--train_steps TRAIN_STEPS] [--single_pass] [--epochs EPOCHS]
                [--optim {sgd,adagrad,adadelta,adam,sparseadam,adafactor,fusedadam}]
                [--adagrad_accumulator_init ADAGRAD_ACCUMULATOR_INIT]
                [--max_grad_norm MAX_GRAD_NORM] [--dropout DROPOUT]
                [--truncated_decoder TRUNCATED_DECODER]
                [--adam_beta1 ADAM_BETA1] [--adam_beta2 ADAM_BETA2]
                [--label_smoothing LABEL_SMOOTHING]
                [--average_decay AVERAGE_DECAY]
                [--average_every AVERAGE_EVERY]
                [--learning_rate LEARNING_RATE]
                [--learning_rate_decay LEARNING_RATE_DECAY]
                [--start_decay_steps START_DECAY_STEPS]
                [--decay_steps DECAY_STEPS] [--decay_method {noam,rsqrt,none}]
                [--warmup_steps WARMUP_STEPS] [--report_every REPORT_EVERY]
                [--log_file LOG_FILE]
                [--log_file_level {DEBUG,CRITICAL,WARNING,NOTSET,INFO,ERROR,10,50,30,0,20,40}]
                [--exp_host EXP_HOST] [--exp EXP] [--tensorboard]
                [--tensorboard_log_dir TENSORBOARD_LOG_DIR]
                [--sample_rate SAMPLE_RATE] [--window_size WINDOW_SIZE]
                [--image_channel_size {3,1}]

Named Arguments

-config, --config
 config file path
-save_config, --save_config
 config file save path

Model-Embeddings

--src_word_vec_size, -src_word_vec_size
 

Word embedding size for src.

Default: 500

--tgt_word_vec_size, -tgt_word_vec_size
 

Word embedding size for tgt.

Default: 500

--word_vec_size, -word_vec_size
 

Word embedding size for src and tgt.

Default: -1

--share_decoder_embeddings, -share_decoder_embeddings
 

Use a shared weight matrix for the input and output word embeddings in the decoder.

Default: False

--share_embeddings, -share_embeddings
 

Share the word embeddings between encoder and decoder. Need to use shared dictionary for this option.

Default: False

--position_encoding, -position_encoding
 

Use a sin to mark relative words positions. Necessary for non-RNN style models.

Default: False

Model-Embedding Features

--feat_merge, -feat_merge
 

Possible choices: concat, sum, mlp

Merge action for incorporating features embeddings. Options [concat|sum|mlp].

Default: “concat”

--feat_vec_size, -feat_vec_size
 

If specified, feature embedding sizes will be set to this. Otherwise, feat_vec_exponent will be used.

Default: -1

--feat_vec_exponent, -feat_vec_exponent
 

If -feat_merge_size is not set, feature embedding sizes will be set to N^feat_vec_exponent where N is the number of values the feature takes.

Default: 0.7

Model- Encoder-Decoder

--model_type, -model_type
 

Possible choices: text, img, audio

Type of source model to use. Allows the system to incorporate non-text inputs. Options are [text|img|audio].

Default: “text”

--model_dtype, -model_dtype
 

Possible choices: fp32, fp16

Data type of the model.

Default: “fp32”

--encoder_type, -encoder_type
 

Possible choices: rnn, brnn, mean, transformer, cnn

Type of encoder layer to use. Non-RNN layers are experimental. Options are [rnn|brnn|mean|transformer|cnn].

Default: “rnn”

--decoder_type, -decoder_type
 

Possible choices: rnn, transformer, cnn

Type of decoder layer to use. Non-RNN layers are experimental. Options are [rnn|transformer|cnn].

Default: “rnn”

--layers, -layers
 

Number of layers in enc/dec.

Default: -1

--enc_layers, -enc_layers
 

Number of layers in the encoder

Default: 2

--dec_layers, -dec_layers
 

Number of layers in the decoder

Default: 2

--rnn_size, -rnn_size
 

Size of rnn hidden states. Overwrites enc_rnn_size and dec_rnn_size

Default: -1

--enc_rnn_size, -enc_rnn_size
 

Size of encoder rnn hidden states. Must be equal to dec_rnn_size except for speech-to-text.

Default: 500

--dec_rnn_size, -dec_rnn_size
 

Size of decoder rnn hidden states. Must be equal to enc_rnn_size except for speech-to-text.

Default: 500

--audio_enc_pooling, -audio_enc_pooling
 

The amount of pooling of audio encoder, either the same amount of pooling across all layers indicated by a single number, or different amounts of pooling per layer separated by comma.

Default: “1”

--cnn_kernel_width, -cnn_kernel_width
 

Size of windows in the cnn, the kernel_size is (cnn_kernel_width, 1) in conv layer

Default: 3

--input_feed, -input_feed
 

Feed the context vector at each time step as additional input (via concatenation with the word embeddings) to the decoder.

Default: 1

--bridge, -bridge
 

Have an additional layer between the last encoder state and the first decoder state

Default: False

--rnn_type, -rnn_type
 

Possible choices: LSTM, GRU, SRU

The gate type to use in the RNNs

Default: “LSTM”

--brnn, -brnn Deprecated, use encoder_type.
--context_gate, -context_gate
 

Possible choices: source, target, both

Type of context gate to use. Do not select for no context gate.

Model- Attention

--global_attention, -global_attention
 

Possible choices: dot, general, mlp, none

The attention type to use: dotprod or general (Luong) or MLP (Bahdanau)

Default: “general”

--global_attention_function, -global_attention_function
 

Possible choices: softmax, sparsemax

Default: “softmax”

--self_attn_type, -self_attn_type
 

Self attention type in Transformer decoder layer – currently “scaled-dot” or “average”

Default: “scaled-dot”

--max_relative_positions, -max_relative_positions
 

Maximum distance between inputs in relative positions representations. For more detailed information, see: https://arxiv.org/pdf/1803.02155.pdf

Default: 0

--heads, -heads
 

Number of heads for transformer self-attention

Default: 8

--transformer_ff, -transformer_ff
 

Size of hidden transformer feed-forward

Default: 2048

--copy_attn, -copy_attn
 

Train copy attention layer.

Default: False

--copy_attn_type, -copy_attn_type
 

Possible choices: dot, general, mlp, none

The copy attention type to use. Leave as None to use the same as -global_attention.

--generator_function, -generator_function
 

Possible choices: softmax, sparsemax

Which function to use for generating probabilities over the target vocabulary (choices: softmax, sparsemax)

Default: “softmax”

--copy_attn_force, -copy_attn_force
 

When available, train to copy.

Default: False

--reuse_copy_attn, -reuse_copy_attn
 

Reuse standard attention for copy

Default: False

--copy_loss_by_seqlength, -copy_loss_by_seqlength
 

Divide copy loss by length of sequence

Default: False

--coverage_attn, -coverage_attn
 

Train a coverage attention layer.

Default: False

--lambda_coverage, -lambda_coverage
 

Lambda value for coverage.

Default: 1

--loss_scale, -loss_scale
 

For FP16 training, the static loss scale to use. If not set, the loss scale is dynamically computed.

Default: 0

General

--data, -data Path prefix to the “.train.pt” and “.valid.pt” file path from preprocess.py
--save_model, -save_model
 

Model filename (the model will be saved as _N.pt where N is the number of steps

Default: “model”

--save_checkpoint_steps, -save_checkpoint_steps
 

Save a checkpoint every X steps

Default: 5000

--keep_checkpoint, -keep_checkpoint
 

Keep X checkpoints (negative: keep all)

Default: -1

--gpuid, -gpuid
 

Deprecated see world_size and gpu_ranks.

Default: []

--gpu_ranks, -gpu_ranks
 

list of ranks of each process.

Default: []

--world_size, -world_size
 

total number of distributed processes.

Default: 1

--gpu_backend, -gpu_backend
 

Type of torch distributed backend

Default: “nccl”

--gpu_verbose_level, -gpu_verbose_level
 

Gives more info on each process per GPU.

Default: 0

--master_ip, -master_ip
 

IP of master for torch.distributed training.

Default: “localhost”

--master_port, -master_port
 

Port of master for torch.distributed training.

Default: 10000

--seed, -seed

Random seed used for the experiments reproducibility.

Default: -1

Initialization

--param_init, -param_init
 

Parameters are initialized over uniform distribution with support (-param_init, param_init). Use 0 to not use initialization

Default: 0.1

--param_init_glorot, -param_init_glorot
 

Init parameters with xavier_uniform. Required for transfomer.

Default: False

--train_from, -train_from
 

If training from a checkpoint then this is the path to the pretrained model’s state_dict.

Default: “”

--reset_optim, -reset_optim
 

Possible choices: none, all, states, keep_states

Optimization resetter when train_from.

Default: “none”

--pre_word_vecs_enc, -pre_word_vecs_enc
 If a valid path is specified, then this will load pretrained word embeddings on the encoder side. See README for specific formatting instructions.
--pre_word_vecs_dec, -pre_word_vecs_dec
 If a valid path is specified, then this will load pretrained word embeddings on the decoder side. See README for specific formatting instructions.
--fix_word_vecs_enc, -fix_word_vecs_enc
 

Fix word embeddings on the encoder side.

Default: False

--fix_word_vecs_dec, -fix_word_vecs_dec
 

Fix word embeddings on the decoder side.

Default: False

Optimization- Type

--batch_size, -batch_size
 

Maximum batch size for training

Default: 64

--batch_type, -batch_type
 

Possible choices: sents, tokens

Batch grouping for batch_size. Standard is sents. Tokens will do dynamic batching

Default: “sents”

--normalization, -normalization
 

Possible choices: sents, tokens

Normalization method of the gradient.

Default: “sents”

--accum_count, -accum_count
 

Accumulate gradient this many times. Approximately equivalent to updating batch_size * accum_count batches at once. Recommended for Transformer.

Default: 1

--valid_steps, -valid_steps
 

Perfom validation every X steps

Default: 10000

--valid_batch_size, -valid_batch_size
 

Maximum batch size for validation

Default: 32

--max_generator_batches, -max_generator_batches
 

Maximum batches of words in a sequence to run the generator on in parallel. Higher is faster, but uses more memory. Set to 0 to disable.

Default: 32

--train_steps, -train_steps
 

Number of training steps

Default: 100000

--single_pass, -single_pass
 

Make a single pass over the training dataset.

Default: False

--epochs, -epochs
 

Deprecated epochs see train_steps

Default: 0

--optim, -optim
 

Possible choices: sgd, adagrad, adadelta, adam, sparseadam, adafactor, fusedadam

Optimization method.

Default: “sgd”

--adagrad_accumulator_init, -adagrad_accumulator_init
 

Initializes the accumulator values in adagrad. Mirrors the initial_accumulator_value option in the tensorflow adagrad (use 0.1 for their default).

Default: 0

--max_grad_norm, -max_grad_norm
 

If the norm of the gradient vector exceeds this, renormalize it to have the norm equal to max_grad_norm

Default: 5

--dropout, -dropout
 

Dropout probability; applied in LSTM stacks.

Default: 0.3

--truncated_decoder, -truncated_decoder
 

Truncated bptt.

Default: 0

--adam_beta1, -adam_beta1
 

The beta1 parameter used by Adam. Almost without exception a value of 0.9 is used in the literature, seemingly giving good results, so we would discourage changing this value from the default without due consideration.

Default: 0.9

--adam_beta2, -adam_beta2
 

The beta2 parameter used by Adam. Typically a value of 0.999 is recommended, as this is the value suggested by the original paper describing Adam, and is also the value adopted in other frameworks such as Tensorflow and Kerras, i.e. see: https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer or https://keras.io/optimizers/ . Whereas recently the paper “Attention is All You Need” suggested a value of 0.98 for beta2, this parameter may not work well for normal models / default baselines.

Default: 0.999

--label_smoothing, -label_smoothing
 

Label smoothing value epsilon. Probabilities of all non-true labels will be smoothed by epsilon / (vocab_size - 1). Set to zero to turn off label smoothing. For more detailed information, see: https://arxiv.org/abs/1512.00567

Default: 0.0

--average_decay, -average_decay
 

Moving average decay. Set to other than 0 (e.g. 1e-4) to activate. Similar to Marian NMT implementation: http://www.aclweb.org/anthology/P18-4020 For more detail on Exponential Moving Average: https://en.wikipedia.org/wiki/Moving_average

Default: 0

--average_every, -average_every
 

Step for moving average. Default is every update, if -average_decay is set.

Default: 1

Optimization- Rate

--learning_rate, -learning_rate
 

Starting learning rate. Recommended settings: sgd = 1, adagrad = 0.1, adadelta = 1, adam = 0.001

Default: 1.0

--learning_rate_decay, -learning_rate_decay
 

If update_learning_rate, decay learning rate by this much if steps have gone past start_decay_steps

Default: 0.5

--start_decay_steps, -start_decay_steps
 

Start decaying every decay_steps after start_decay_steps

Default: 50000

--decay_steps, -decay_steps
 

Decay every decay_steps

Default: 10000

--decay_method, -decay_method
 

Possible choices: noam, rsqrt, none

Use a custom decay rate.

Default: “none”

--warmup_steps, -warmup_steps
 

Number of warmup steps for custom decay.

Default: 4000

Logging

--report_every, -report_every
 

Print stats at this interval.

Default: 50

--log_file, -log_file
 

Output logs to a file under this path.

Default: “”

--log_file_level, -log_file_level
 

Possible choices: DEBUG, CRITICAL, WARNING, NOTSET, INFO, ERROR, 10, 50, 30, 0, 20, 40

Default: “0”

--exp_host, -exp_host
 

Send logs to this crayon server.

Default: “”

--exp, -exp

Name of the experiment for logging.

Default: “”

--tensorboard, -tensorboard
 

Use tensorboardX for visualization during training. Must have the library tensorboardX.

Default: False

--tensorboard_log_dir, -tensorboard_log_dir
 

Log directory for Tensorboard. This is also the name of the run.

Default: “runs/onmt”

Speech

--sample_rate, -sample_rate
 

Sample rate.

Default: 16000

--window_size, -window_size
 

Window size for spectrogram in seconds.

Default: 0.02

--image_channel_size, -image_channel_size
 

Possible choices: 3, 1

Using grayscale image can training model faster and smaller

Default: 3