Doc: Modules

Core Modules

class onmt.modules.Embeddings(word_vec_size, word_vocab_size, word_padding_idx, position_encoding=False, feat_merge='concat', feat_vec_exponent=0.7, feat_vec_size=-1, feat_padding_idx=[], feat_vocab_sizes=[], dropout=0, sparse=False)[source]

Words embeddings for encoder/decoder.

Additionally includes ability to add sparse input features based on “Linguistic Input Features Improve Neural Machine Translation” [SH16].

graph LR A[Input] C[Feature 1 Lookup] A-->B[Word Lookup] A-->C A-->D[Feature N Lookup] B-->E[MLP/Concat] C-->E D-->E E-->F[Output]
  • word_vec_size (int) – size of the dictionary of embeddings.
  • word_padding_idx (int) – padding index for words in the embeddings.
  • feats_padding_idx (list of int) – padding index for a list of features in the embeddings.
  • word_vocab_size (int) – size of dictionary of embeddings for words.
  • feat_vocab_sizes ([int], optional) – list of size of dictionary of embeddings for each feature.
  • position_encoding (bool) – see onmt.modules.PositionalEncoding
  • feat_merge (string) – merge action for the features embeddings: concat, sum or mlp.
  • feat_vec_exponent (float) – when using -feat_merge concat, feature embedding size is N^feat_dim_exponent, where N is the number of values of feature takes.
  • feat_vec_size (int) – embedding dimension for features when using -feat_merge mlp
  • dropout (float) – dropout probability.

embedding look-up table

forward(source, step=None)[source]

Computes the embeddings for words and features.

Parameters:source (LongTensor) – index tensor [len x batch x nfeat]
Returns:word embeddings [len x batch x embedding_size]
Return type:FloatTensor
load_pretrained_vectors(emb_file, fixed)[source]

Load in pretrained embeddings.

  • emb_file (str) – path to torch serialized embeddings
  • fixed (bool) – if true, embeddings are not updated

word look-up table




class onmt.modules.GlobalAttention(dim, coverage=False, attn_type='dot', attn_func='softmax')[source]

Global attention takes a matrix and a query vector. It then computes a parameterized convex combination of the matrix based on the input query.

Constructs a unit mapping a query q of size dim and a source matrix H of size n x dim, to an output of size dim.

graph BT A[Query] subgraph RNN C[H 1] D[H 2] E[H N] end F[Attn] G[Output] A --> F C --> F D --> F E --> F C -.-> G D -.-> G E -.-> G F --> G

All models compute the output as \(c = sum_{j=1}^{SeqLength} a_j H_j\) where \(a_j\) is the softmax of a score function. Then then apply a projection layer to [q, c].

However they differ on how they compute the attention score.

  • Luong Attention (dot, general):
    • dot: \(score(H_j,q) = H_j^T q\)
    • general: \(score(H_j, q) = H_j^T W_a q\)
  • Bahdanau Attention (mlp):
    • \(score(H_j, q) = v_a^T tanh(W_a q + U_a h_j)\)
  • dim (int) – dimensionality of query and key
  • coverage (bool) – use coverage term
  • attn_type (str) – type of attention to use, options [dot,general,mlp]
forward(source, memory_bank, memory_lengths=None, coverage=None)[source]
  • source (FloatTensor) – query vectors [batch x tgt_len x dim]
  • memory_bank (FloatTensor) – source vectors [batch x src_len x dim]
  • memory_lengths (LongTensor) – the source context lengths [batch]
  • coverage (FloatTensor) – None (not supported yet)

  • Computed vector [tgt_len x batch x dim]
  • Attention distribtutions for each query
    [tgt_len x batch x src_len]

Return type:

(FloatTensor, FloatTensor)

score(h_t, h_s)[source]
  • h_t (FloatTensor) – sequence of queries [batch x tgt_len x dim]
  • h_s (FloatTensor) – sequence of sources [batch x src_len x dim]

raw attention scores (unnormalized) for each src index [batch x tgt_len x src_len]

Return type:


Architecture: Transfomer

class onmt.modules.PositionalEncoding(dropout, dim, max_len=5000)[source]

Implements the sinusoidal positional encoding for non-recurrent neural networks.

Implementation based on “Attention Is All You Need” [DBLP:journals/corr/VaswaniSPUJGKP17]

  • dropout (float) – dropout parameter
  • dim (int) – embedding size
forward(emb, step=None)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.


Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class onmt.modules.MultiHeadedAttention(head_count, model_dim, dropout=0.1)[source]

Multi-Head Attention module from “Attention is All You Need” [DBLP:journals/corr/VaswaniSPUJGKP17].

Similar to standard dot attention but uses multiple attention distributions simulataneously to select relevant items.

graph BT A[key] B[value] C[query] O[output] subgraph Attn D[Attn 1] E[Attn 2] F[Attn N] end A --> D C --> D A --> E C --> E A --> F C --> F D --> O E --> O F --> O B --> O

Also includes several additional tricks.

  • head_count (int) – number of parallel heads
  • model_dim (int) – the dimension of keys/values/queries, must be divisible by head_count
  • dropout (float) – dropout parameter
forward(key, value, query, mask=None, layer_cache=None, type=None)[source]

Compute the context vector and the attention vectors.

  • key (FloatTensor) – set of key_len key vectors [batch, key_len, dim]
  • value (FloatTensor) – set of key_len value vectors [batch, key_len, dim]
  • query (FloatTensor) – set of query_len query vectors [batch, query_len, dim]
  • mask – binary mask indicating which keys have non-zero attention [batch, query_len, key_len]

  • output context vectors [batch, query_len, dim]
  • one of the attention vectors [batch, query_len, key_len]

Return type:

(FloatTensor, FloatTensor)

Architecture: Conv2Conv

(These methods are from a user contribution and have not been thoroughly tested.)

class onmt.modules.ConvMultiStepAttention(input_size)[source]

Conv attention takes a key matrix, a value matrix and a query vector. Attention weight is calculated by key matrix with the query vector and sum on the value matrix. And the same operation is applied in each decode conv layer.


Apply mask

forward(base_target_emb, input_from_dec, encoder_out_top, encoder_out_combine)[source]
  • base_target_emb – target emb tensor
  • input – output of decode conv
  • encoder_out_t – the key matrix for calculation of attetion weight, which is the top output of encode conv
  • encoder_out_combine – the value matrix for the attention-weighted sum, which is the combination of base emb and top output of encode

Architecture: SRU

Alternative Encoders



Copy Attention

class onmt.modules.CopyGenerator(input_size, output_size, pad_idx)[source]

An implementation of pointer-generator networks (See et al., 2017) (, which consider copying words directly from the source sequence.

The copy generator is an extended version of the standard generator that computes three values.

  • \(p_{softmax}\) the standard softmax over tgt_dict
  • \(p(z)\) the probability of copying a word from the source
  • \(p_{copy}\) the probility of copying a particular word. taken from the attention distribution directly.

The model returns a distribution over the extend dictionary, computed as

\(p(w) = p(z=1) p_{copy}(w) + p(z=0) p_{softmax}(w)\)

graph BT A[input] S[src_map] B[softmax] BB[switch] C[attn] D[copy] O[output] A --> B A --> BB S --> D C --> D D --> O B --> O BB --> O
  • input_size (int) – size of input representation
  • output_size (int) – size of output vocabulary
  • pad_idx (int) –
forward(hidden, attn, src_map)[source]

Compute a distribution over the target dictionary extended by the dynamic dictionary implied by compying source words.

  • hidden (FloatTensor) – hidden outputs [batch*tlen, input_size]
  • attn (FloatTensor) – attn for each [batch*tlen, input_size]
  • src_map (FloatTensor) – A sparse indicator matrix mapping each source word to its index in the “extended” vocab containing. [src_len, batch, extra_words]

Structured Attention