expanda.extension¶
Introduction¶
Corpora, which are crawled or downloaded from the Internet, have diverse structures and formats. That is because the purpose of each content is different. For instance, Wikipedia is a collection of articles and Twitter consists of tweets. Furthermore, structures are even changed by platforms of corpus contents. The same wiki contents can have different structures depending on their wiki engines.
The very purpose of expanda is to integrate procedures and pipelines for building corpus dataset. In order to consolidate the process with various corpora, their formats should be regularized equally.
This module, for that reasons, provides an interface to construct regularization procedures for corresponding corpus formats. The regularization procedures are treated as extensions and literally, everyone can write codes for parsing their own corpus.
Every extensions should have __extension__
variable in global. Basic form
of the variable is as follows:
__extension__ = {
'name': 'The name of extension',
'version': '1.0',
'description': 'This is a simple example of extension',
'author': 'Who write this code?',
'main': <extension implementation>,
'arguments': {
'param name': {'type': <param type>, 'default': <default>},
...
}
}
Expanda uses extension by importing the corresponding module, so
__extension__
variable must be in global scope of the module. All
parameters except main are optional. There is no problem to skip writing the
information. But main parameter, which is an implementation of the extension,
should be defined in __extension__
variable. The implementation function
would get 4 arguments:
def main_code(input_file: str, output_file: str, temporary: str, args: Dict[str, Any]):
# Implementation of the extension...
input_file and output_file are literally an input raw-format corpus file
and parsed output file respectively. temporary is an assigned temporary
directory to use while executing the extension. args is a dictionary from
expanda configuration. Note that arguments would be cast to the types
defined in __extension__
.
The role of implementation is simple. Read the given input_file and extract it to plain text. After splitting the text into single sentences, save the sentences to output_file. Expanda will summarize pipelines from configuration and automatically execute extensions. Extracted texts are combined and other procedures are applied to the corpora.
Expanda provides some useful extensions in expanda.ext
package. See also
Extensions.
Note
While extracting corpus, temporary files might be needed. Expanda
recommends using Utils for Extension when creating
temporary files. All extensions Expanda provides create files with
random_filename
and random_filenames
.
Classes¶
-
class
expanda.extension.
Extension
(module_name: str)¶ Wrapper class of extension.
Every extension should define their information into
__extension__
variable. For using extensions, this wrapper class provides a simple interface to handle them. It summarizes attributes and helps executing them.Caution
This class dynamically import the given module_name. Make sure that the extension module can be imported in current environment.
- Parameters
module_name (str) – Module name of extension.
-
call
(input_file: str, output_file: str, temporary: str, raw_args: Dict[str, str])¶ Call main code of the extension.
Note
Extensions have their own parameter requirements in
__extension__
variable. This function automatically casts the type of each string-formatted raw argument.- Parameters
input_file (str) – Input file path.
output_file (str) – Output file path.
temporary (str) – Temporary directory where the extension would use.
raw_args (dict) – String-formatted raw arguments for extension.