expanda.extension

Introduction

Corpora, which are crawled or downloaded from the Internet, have diverse structures and formats. That is because the purpose of each content is different. For instance, Wikipedia is a collection of articles and Twitter consists of tweets. Furthermore, structures are even changed by platforms of corpus contents. The same wiki contents can have different structures depending on their wiki engines.

The very purpose of expanda is to integrate procedures and pipelines for building corpus dataset. In order to consolidate the process with various corpora, their formats should be regularized equally.

This module, for that reasons, provides an interface to construct regularization procedures for corresponding corpus formats. The regularization procedures are treated as extensions and literally, everyone can write codes for parsing their own corpus.

Every extensions should have __extension__ variable in global. Basic form of the variable is as follows:

__extension__ = {
    'name': 'The name of extension',
    'version': '1.0',
    'description': 'This is a simple example of extension',
    'author': 'Who write this code?',
    'main': <extension implementation>,
    'arguments': {
        'param name': {'type': <param type>, 'default': <default>},
        ...
    }
}

Expanda uses extension by importing the corresponding module, so __extension__ variable must be in global scope of the module. All parameters except main are optional. There is no problem to skip writing the information. But main parameter, which is an implementation of the extension, should be defined in __extension__ variable. The implementation function would get 4 arguments:

def main_code(input_file: str, output_file: str, temporary: str, args: Dict[str, Any]):
    # Implementation of the extension...

input_file and output_file are literally an input raw-format corpus file and parsed output file respectively. temporary is an assigned temporary directory to use while executing the extension. args is a dictionary from expanda configuration. Note that arguments would be cast to the types defined in __extension__.

The role of implementation is simple. Read the given input_file and extract it to plain text. After splitting the text into single sentences, save the sentences to output_file. Expanda will summarize pipelines from configuration and automatically execute extensions. Extracted texts are combined and other procedures are applied to the corpora.

Expanda provides some useful extensions in expanda.ext package. See also Extensions.

Note

While extracting corpus, temporary files might be needed. Expanda recommends using Utils for Extension when creating temporary files. All extensions Expanda provides create files with random_filename and random_filenames.

Classes

class expanda.extension.Extension(module_name: str)

Wrapper class of extension.

Every extension should define their information into __extension__ variable. For using extensions, this wrapper class provides a simple interface to handle them. It summarizes attributes and helps executing them.

Caution

This class dynamically import the given module_name. Make sure that the extension module can be imported in current environment.

Parameters

module_name (str) – Module name of extension.

call(input_file: str, output_file: str, temporary: str, raw_args: Dict[str, str])

Call main code of the extension.

Note

Extensions have their own parameter requirements in __extension__ variable. This function automatically casts the type of each string-formatted raw argument.

Parameters
  • input_file (str) – Input file path.

  • output_file (str) – Output file path.

  • temporary (str) – Temporary directory where the extension would use.

  • raw_args (dict) – String-formatted raw arguments for extension.