edit

トークン処理

OpenNMT provides generic tokenization utilities to quickly process new training data.

Tokenization

To tokenize a corpus:

th tools/tokenize.lua OPTIONS < file > file.tok

Detokenization

If you activate -joiner_annotate marker, the tokenization is reversible. Just use:

th tools/detokenize.lua OPTIONS < file.tok > file.detok

Special characters

  • is the feature separator symbol. If such character is used in source text, it is replaced by its non presentation form .
  • is the default joiner marker (generated in -joiner_annotate marker mode). If such character is used in source text, it is replaced by its non presentation form
  • ⦅...⦆ are marking a sequence as protected - it won't be tokenized and its case feature is N.

Mixed casing words

-segment_case feature enables tokenizer to segment words into subwords with one of 3 casing types (truecase ('House'), uppercase ('HOUSE') or lowercase ('house')), which helps restore right casing during detokenization. This feature is especially useful for texts with a signficant number of words with mixed casing ('WiFi' -> 'Wi' and 'Fi').

WiFi --> wi│C fi│C
TVs --> tv│U s│L

Alphabet Segmentation

Two options provide specific tokenization depending on alphabet:

  • -segment_alphabet_change: tokenize a sequence between two letters when their alphabets differ - for instance between a Latin alphabet character and a Han character.
  • -segment_alphabet Alphabet: tokenize all words of the indicated alphabet into characters - for instance to split a chinese sentence into characters, use -segment_alphabet Han:
君子之心不胜其小,而气量涵盖一世。 --> 君 子 之 心 不 胜 其 小 , 而 气 量 涵 盖 一 世 。

BPE

OpenNMT's BPE module fully supports the original BPE as default mode:

tools/learn_bpe.lua -size 30000 -save_bpe codes < input
tools/tokenize.lua -bpe_model codes < input

with two additional features:

1. Add support for different modes of handling prefixes and/or suffixes: -bpe_mode

  • suffix: BPE merge operations are learnt to distinguish sub-tokens like "ent" in the middle of a word and "ent<\w>" at the end of a word. "<\w>" is an artificial marker appended to the end of each token input and treated as a single unit before doing statistics on bigrams. This is the default mode which is useful for most of the languages.
  • prefix: BPE merge operations are learnt to distinguish sub-tokens like "ent" in the middle of a word and "ent" at the beginning of a word. "" is an artificial marker appended to the beginning of each token input and treated as a single unit before doing statistics on bigrams.
  • both: suffix + prefix
  • none: No artificial marker is appended to input tokens, a sub-token is treated equally whether it is in the middle or at the beginning or at the end of a token.

2. Add support for BPE in addition to the case feature: -bpe_case_insensitive

OpenNMT's tokenization flow first applies BPE then add the case feature for each input token. With the standard BPE, "Constitution" and "constitution" may result in the different sequences of sub-tokens:

Constitution --> con│C sti│l tu│l tion│l
constitution --> consti│l tu│l tion│l

If you want a caseless split so that you can take the best from using case feature, and you can achieve that with the following command lines:

# We don't need BPE to care about case
tools/learn_bpe.lua -size 30000 -save_bpe codes_lc < input_lowercased

# The case information is preserved in the true case input
tools/tokenize.lua -bpe_model codes_lc -bpe_case_insensitive < input

The output of the previous example would be:

Constitution --> con│C sti│l tu│l tion│l
constitution --> con│l sti│l tu│l tion│l