Sub-word Units

The power of sub-word units instead of word-level units has been witnessed in much research of Natural Language Processing (NLP) and Machine Translation (MT).

We’ve seen:

  • Character

  • Morpheme

  • Short/Long unit word (especially for Japanese in some specific corpora)

  • Byte Pair Encoding (BPE) (Sennrich et. al, 2016)

  • Stroke/Radical (especially for Chinese characters) (???, 2017/8)

Sub-word units work well in the following cases: