expanda.tokenization

Introduction

Traditionally, it was believed that words – units of language – cannot be split in machine learning. Sometimes words are decomposed into characters in some cases, but major models use word-level approaches. Of course, the approaches worked well. Even today, there is no problem to solve simple natural language tasks through word-level based models.

However, in deep learning, the number of words in corpus is immensely increased. Compared to traditional machine learning, deep learning treats more general problems, and hence, contents from various categories are used and it eventually leads growth of words.

Why the amount of words actually matters? Before discussing that, we need to understand how to treat words as computable components. There are many ways to encode a word to vector (which is one of the computable forms). The simplest and well-known method is One-Hot Encoding. It deals word as a basis vector. For instance, the words are mapped as follows: \(\text{apple} = (1, 0, 0)\), \(\text{banana} = (0, 1, 0)\) and \(\text{orange} = (0, 0, 1)\). It is intuitive and easy to implement. It does not even require any pretrainings Each vector, however, does not contain contextual semantics of the corresponding word. Moreover, because the vectors are independent of each other, the size of vectors follows \(O(n^2)\).

So an alternative method has been recommended. Since the dimensionality of each vector was the matter, the method suggests using fixed dimensional vectors in embedding. The vectors are randomly initialized at first and then trained with other parameters. Fixed dimensionality reduces the space complexity to \(O(n)\).

In fact, fixed embedding is theoretically the same as One-Hot Encoding in usual cases. In terms of linear algebra, it holds that

\[\newcommand{\M}{\mathcal{M}} \newcommand{\x}{\boldsymbol{x}} \newcommand{\e}{\boldsymbol{e}} \M(\x_i) = \hat{\M}(E \x_i) = \hat{\M}(\e_i)\]

where \(\M\) is a neural network with feed-forward layer \(E = [ \e_1, \e_2, \cdots, \e_V ]\) where \(\e_i \in \mathbb{R}^D\), and \(\x_i \in \{0, 1\}^V\) is a standard basis vector of which elements are all zero, except \(i\)th index that equals 1. \(V\) and \(D\) imply vocabulary size and dimensionality of the model respectively. We can interpret \(\e_i\) as a fixed dimensional embedding vector of \(i\)th word in vocabulary.

Yet, it is insufficient for extremly large corpus. The large corpus has numerous words and the embedding vectors still possess almost of memory. The last way to decrease memory is to decrease vocabulary size. Similar to the case of embedding, we can consider fixing the number of words. If words can be split into sub-words such as morphemes, so the similar parts of the words would be reused, then the vocabulary size would be reduced effectively by constructing vocabulary with those subwords.

Then, how can we tokenize words into their subwords? Actually, subword tokenization is quite fickle, ambiguous and arbitrary. There are many approaches to split words into their subwords. BPE (Byte-Pair Encoding) (1), Unigram LM (2) and WordPiece Model (3) are representatives. They are all used in subword tokenization and perform well. Although there is a little difference in performance between them, it does not affect the overall performance of model.

In this module, we use WordPiece Model which is implemented in HuggingFace Tokenizers. Thanks to tokenizers, this module provides training WordPiece Model and tokenizing the whole corpus into subwords. You can use those functions by importing this module or, simply try in command line. See Command-line Usage.

Functions

expanda.tokenization.train_tokenizer(input_file: str, vocab_file: str, temporary: str, subset_size: int = 512000000, vocab_size: int = 8000, limit_alphabet: int = 6000, unk_token: str = '<unk>', control_tokens: List[str] = [])

Train WordPiece tokenizer and save trained subword vocabulary.

Note

Since tokenizers reads whole file data in training, this function could occur memory errors if input_file is too large. Under the assumption that input_file is shuffled randomly, the subset of input corpus will be used in training.

Caution

The subset of input corpus is saved in temporary directory. Please be careful not to delete the file while executing this function.

Parameters
  • input_file (str) – Input file path.

  • vocab_file (str) – Output vocabulary file path.

  • temporary (str) – Temporary directory where the subset of corpus would be saved.

  • subset_size (int) – The maximum number of lines in the subset.

  • vocab_size (int) – The number of subwords in the vocabulary.

  • limit_alphabet (int) – The maximum number of alphabets in vocabulary.

  • unk_tokens (str) – Unknown token in the vocabulary.

  • control_tokens (list) – Control tokens in the vocabulary.

expanda.tokenization.tokenize_corpus(input_file: str, output_file: str, vocab_file: str, unk_token: str = '<unk>', control_tokens: List[str] = [])

Tokenize corpus sentences through trained WordPiece model.

Parameters
  • input_file (str) – Input corpus file path.

  • output_file (str) – Output file path.

  • vocab_file (str) – Trained vocabulary file path.

  • unk_token (str) – Unknown token in the vocabulary.

  • control_tokens (list) – Control tokens in the vocabulary.

Command-line Usage

usage: expanda-tokenization train [-h] [--tmp TMP] [--subset_size SUBSET_SIZE]
                                  [--vocab_size VOCAB_SIZE]
                                  [--unk_token UNK_TOKEN]
                                  [--control_tokens [CONTROL_TOKENS [CONTROL_TOKENS ...]]]
                                  input [vocab]

positional arguments:
  input
  vocab                 output vocabulary file

optional arguments:
  -h, --help            show this help message and exit
  --tmp TMP             temporary directory path
  --subset_size SUBSET_SIZE
                        maximum number of lines in subset
  --vocab_size VOCAB_SIZE
                        number of subwords in vocabulary
  --unk_token UNK_TOKEN
                        unknown token name
  --control_tokens [CONTROL_TOKENS [CONTROL_TOKENS ...]]
                        control token names except unknown token
usage: expanda.tokenization tokenize [-h] [--unk_token UNK_TOKEN]
                                     [--control_tokens [CONTROL_TOKENS [CONTROL_TOKENS ...]]]
                                     input output vocab

positional arguments:
  input
  output
  vocab

optional arguments:
  -h, --help            show this help message and exit
  --unk_token UNK_TOKEN
                        unknown token name
  --control_tokens [CONTROL_TOKENS [CONTROL_TOKENS ...]]
                        control token names except unknown token

References

1

Philip Gage in a February 1994 article “A New Algorithm for Data Compression” in the C Users Journal.

2

T. Kudo. 2018. “Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates”

3

Y. Wu et al. 2016. “Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation”