Expanda documentation¶
Introduction¶
Expanda is an integrated corpus-building environment. Expanda provides integrated pipelines for building a corpus dataset. Building corpus dataset requires several complicated pipelines such as parsing, shuffling, and tokenization. If the corpora are gathered from different applications, it would be a problem to parse various formats. Expanda helps to build corpus simply at once by setting build configuration.
Dependencies¶
nltk
ijson
tqdm>=4.46.0
mwparserfromhell>=0.5.4
tokenizers>=0.7.0
kss==1.3.1
Installation¶
From source¶
You can install from source by cloning the repository and running:
$ git clone https://github.com/affjljoo3581/Expanda.git
$ cd Expanda
$ python setup.py install
Command-line Usage¶
You can use the features of Expanda in command-line.
Build Dataset¶
expanda build
command builds corpus dataset through the given build
configuration file. The detail is as follows:
usage: expanda build [-h] [config]
positional arguments:
config expanda configuration file
optional arguments:
-h, --help show this help message and exit
Show Extension Detail¶
After installing extensions, you can check if the extensions are recognizable.
Expanda loads extensions by importing corresponding modules. If certain
extensions are installed in different virtual environment, Expanda cannot
use the extensions. So before building the dataset, check whether the
extensions are accessible or not. expanda show
command shows the details of
the given extension. The detail is as follows:
usage: expanda show [-h] extension
positional arguments:
extension module name of certain extension
optional arguments:
-h, --help show this help message and exit
List of Required Extensions¶
expanda list
command shows a list of extensions defined in the given
configuration. Namely, you can see which extensions are used in this dataset.
The detail is as follows:
usage: expanda list [-h] [config]
positional arguments:
config expanda configuration file
optional arguments:
-h, --help show this help message and exit
Build Configuration¶
Before building a corpus, you need to setup a build configuration in the workspace first. The configuration file follows INI format. Here is an example:
# extension configurations
# ...
[tokenization]
subset-size = 1000000000
vocab-size = 32000
limit-alphabet = 6000
unk-token = <unk>
control-tokens = <s>
</s>
<pad>
[build]
input-files =
--my.extension.foo1 src/bar1.xml
--my.extension.foo2 src/bar2.txt
--my.extension.foo3 src/bar3.xml.bz2
balancing = true
split-ratio = 0.1
temporary-path = tmp
output-vocab = build/vocab.txt
output-train-corpus = build/corpus.train.txt
output-test-corpus = build/corpus.test.txt
output-raw-corpus = build/corpus.raw.txt
Basically, you need to configure two sections – tokenization and
build. tokenization section contains arguments for tokenizing texts,
described in expanda.tokenization. You can declare symbol names and define
tokenization options. In build section, you can set input, output files and
temporary directory. balancing
determines whether to modify the amount of
each corpus uniformly. Note that others like unk-token
and
control-tokens
should be given.
If there is any pretrained vocabulary file for corpora, you can skip training
tokenizer model by setting input-vocab
in build section to the
vocabulary file path. In this case, subset-size
, vocab-size
and
limit-alphabet
arguments in tokenization section would be ignored and,
therefore, you don’t need to adjust the arguments in detail.
Perhaps extensions used for constructing the dataset would need their own options in extracting. You can configure the options in each section with the corresponding module name. See also expanda.extension.
After building the dataset, the workspace should be as below:
workspace
├── build
│ ├── corpus.raw.txt
│ ├── corpus.train.txt
│ ├── corpus.test.txt
│ └── vocab.txt
├── src
│ ├── bar1.xml
│ ├── bar2.txt
│ └── bar3.xml.bz2
└── expanda.cfg
Modules¶
Expanda contains pipeline modules for building corpus dataset. Some useful modules can be used in command-line independently.
Extensions¶
Expanda provides basic extensions to help parsing corpus files. You can use the extensions without any additional installations.