DCR-CORE - Application - Configuration

GitHub (Pre-)Release GitHub (Pre-)Release Date

1. `logging_cfg.yaml`

This file controls the logging behaviour of the application.

Default content:

version: 1

formatters:
  simple:
    format: "%(asctime)s [%(module)s.py  ] %(levelname)-5s %(funcName)s:%(lineno)d %(message)s"
  extended:
    format: "%(asctime)s [%(module)s.py  ] %(levelname)-5s %(funcName)s:%(lineno)d \n%(message)s"

handlers:
  console:
    class: logging.StreamHandler
    level: INFO
    formatter: simple

  file_handler:
    class: logging.FileHandler
    level: INFO
    filename: logging_dcr_core.log
    formatter: extended

loggers:
  dcr_core:
    handlers: [ console ]
root:
  handlers: [ file_handler ]

2. `setup.cfg`

This file controls the behaviour of the DCR-CORE application.

The customisable entries are:

[dcr_core]
create_extra_file_heading = true
create_extra_file_list_bullet = true
create_extra_file_list_number = true
create_extra_file_table = true
delete_auxiliary_files = true
directory_inbox = data/inbox_prod
json_indent = 4
json_sort_keys = false
lt_export_rule_file_heading = data/lt_export_rule_heading.json
lt_export_rule_file_list_bullet = data/lt_export_rule_list_bullet.json
lt_export_rule_file_list_number = data/lt_export_rule_list_number.json
lt_footer_max_distance = 3
lt_footer_max_lines = 3
lt_header_max_distance = 3
lt_header_max_lines = 3
lt_heading_file_incl_no_ctx = 1
lt_heading_file_incl_regexp = false
lt_heading_max_level = 3
lt_heading_min_pages = 2
lt_heading_rule_file = none
lt_heading_tolerance_llx = 10
lt_list_bullet_min_entries = 2
lt_list_bullet_rule_file = none
lt_list_bullet_tolerance_llx = 10
lt_list_number_file_incl_regexp = false
lt_list_number_min_entries = 2
lt_list_number_rule_file = none
lt_list_number_tolerance_llx = 10
lt_table_file_incl_empty_columns = true
lt_toc_last_page = 5
lt_toc_min_entries = 5
pdf2image_type = jpeg
tesseract_timeout = 30
tetml_page = false
tetml_word = false
tokenize_2_database = true
tokenize_2_jsonfile = true
verbose = true
verbose_lt_header_footer = false
verbose_lt_heading = false
verbose_lt_list_bullet = false
verbose_lt_list_number = false
verbose_lt_table = false
verbose_lt_toc = false
verbose_parser = none

Parameter	Description
create_extra_file_heading	Create a separate `JSON` file with the table of contents.
create_extra_file_list_bullet	Create a separate `JSON` file with the bulleted lists.
create_extra_file_list_number	Create a separate `JSON` file with the numbered lists.
create_extra_file_table	Create a separate `JSON` file with the tables.
delete_auxiliary_files	Delete the auxiliary files after a successful processing step.
directory_inbox	Directory for the new documents received.
json_indent	Improves the readability of the `JSON` file.
json_sort_keys	If it is set to `true`, the keys are set in ascending order else, they appear as in the Python object.
lt_export_rule_file_heading	File name for the export of the heading rules.
lt_export_rule_file_list_bullet	File name for the export of the bulleted list rules.
lt_export_rule_file_list_number	File name for the export of the numbered list rules.
lt_footer_max_distance	Maximum Levenshtein distance for a footer line.
lt_footer_max_lines	Maximum number of footers.
lt_header_max_distance	Maximum Levenshtein distance for a header line.
lt_header_max_lines	Maximum number of headers.
lt_heading_file_incl_no_ctx	The number of lines following the heading to be included as context into the `JSON` file.
lt_heading_file_incl_regexp	If it is set to `true`, the regular expression for the heading is included in the `JSON` file.
lt_heading_max_level	Maximum level of the heading structure.
lt_heading_min_pages	Minimum number of pages to determine the headings.
lt_heading_rule_file	File with rules to determine the headings.
lt_heading_tolerance_llx	Tolerance of vertical indentation in percent.
lt_list_bullet_min_entries	Minimum number of entries to determine a bulleted list.
lt_list_bullet_rule_file	File with rules to determine the bulleted lists.
lt_list_bullet_tolerance_llx	Tolerance of vertical indentation in percent.
lt_list_number_file_incl_regexp	If it is set to `true`, the regular expression for the numbered list is included in the `JSON` file.
lt_list_number_min_entries	Minimum number of entries to determine a numbered list.
lt_list_number_rule_file	File with rules to determine the numbered lists.
lt_list_number_tolerance_llx	Tolerance of vertical indentation in percent.
lt_table_file_incl_empty_columns	If it is set to `true`, the the empty cells are included in the separate `JSON` file with the tables.
lt_toc_last_page	Maximum number of pages for the search of the TOC (from the beginning).
lt_toc_min_entries	Minimum number of TOC entries.
pdfimage_type	Format of the image files for the scanned `pdf` document: `jpeg` or `pdf`.
tesseract_timeout	Terminate the tesseract job after a period of time (seconds).
tetml_page	PDFlib TET granularity 'page'.
tetml_word	PDFlib TET granularity 'word'.
tokenize_2_database	Store the tokens in the database table `token`.
tokenize_2_jsonfile	Store the tokens in a `JSON` flat file.
verbose	Display progress messages for processing.
verbose_lt_headers_footers	Display progress messages for headers & footers line type determination.
verbose_lt_heading	Display progress messages for heading line type determination.
verbose_lt_list_bullet	Display progress messages for line type determination of a bulleted list.
verbose_lt_list_number	Display progress messages for line type determination of a numbered list.
verbose_lt_table	Display progress messages for table line type determination.
verbose_lt_toc	Display progress messages for table of content line type determination.
verbose_parser	Display progress messages for parsing `xml` (TETML) : `all`, `none` or `text`.

The configuration parameters can be set differently for the individual environments (dev, prod and test).

Examples:

[dcr_core.env.dev]
delete_auxiliary_files = false
directory_inbox = data/inbox_dev
lt_footer_max_lines = 3
lt_header_max_lines = 3
lt_heading_file_incl_no_ctx = 3
lt_heading_file_incl_regexp = true
lt_heading_tolerance_llx = 5
lt_list_bullet_tolerance_llx = 5
lt_list_number_file_incl_regexp = true
lt_list_number_tolerance_llx = 5
lt_table_file_incl_empty_columns = false
tetml_page = true
tetml_word = true
...

4. `setup.cfg` - spaCy Token Attributes

The tokens derived from the documents can be qualified via various attributes. The available options are described below.

[dcr_core.spacy]
spacy_ignore_bracket = false
spacy_ignore_left_punct = false
spacy_ignore_line_type_footer = false
spacy_ignore_line_type_header = false
spacy_ignore_line_type_heading = false
spacy_ignore_line_type_list_bullet = false
spacy_ignore_line_type_list_number = false
spacy_ignore_line_type_table = false
spacy_ignore_line_type_toc = false
spacy_ignore_punct = false
spacy_ignore_quote = false
spacy_ignore_right_punct = false
spacy_ignore_space = false
spacy_ignore_stop = false
spacy_tkn_attr_cluster = true
spacy_tkn_attr_dep_ = true
spacy_tkn_attr_doc = true
spacy_tkn_attr_ent_iob_ = true
spacy_tkn_attr_ent_kb_id_ = true
spacy_tkn_attr_ent_type_ = true
spacy_tkn_attr_head = true
spacy_tkn_attr_i = true
spacy_tkn_attr_idx = true
spacy_tkn_attr_is_alpha = true
spacy_tkn_attr_is_ascii = true
spacy_tkn_attr_is_bracket = true
spacy_tkn_attr_is_currency = true
spacy_tkn_attr_is_digit = true
spacy_tkn_attr_is_left_punct = true
spacy_tkn_attr_is_lower = true
spacy_tkn_attr_is_oov = true
spacy_tkn_attr_is_punct = true
spacy_tkn_attr_is_quote = true
spacy_tkn_attr_is_right_punct = true
spacy_tkn_attr_is_sent_end = true
spacy_tkn_attr_is_sent_start = true
spacy_tkn_attr_is_space = true
spacy_tkn_attr_is_stop = true
spacy_tkn_attr_is_title = true
spacy_tkn_attr_is_upper = true
spacy_tkn_attr_lang_ = true
spacy_tkn_attr_left_edge = true
spacy_tkn_attr_lemma_ = true
spacy_tkn_attr_lex = true
spacy_tkn_attr_lex_id = true
spacy_tkn_attr_like_email = true
spacy_tkn_attr_like_num = true
spacy_tkn_attr_like_url = true
spacy_tkn_attr_lower_ = true
spacy_tkn_attr_morph = true
spacy_tkn_attr_norm_ = true
spacy_tkn_attr_orth_ = true
spacy_tkn_attr_pos_ = true
spacy_tkn_attr_prefix_ = true
spacy_tkn_attr_prob = true
spacy_tkn_attr_rank = true
spacy_tkn_attr_right_edge = true
spacy_tkn_attr_sent = true
spacy_tkn_attr_sentiment = true
spacy_tkn_attr_shape_ = true
spacy_tkn_attr_suffix_ = true
spacy_tkn_attr_tag_ = true
spacy_tkn_attr_tensor = true
spacy_tkn_attr_text = true
spacy_tkn_attr_text_with_ws = true
spacy_tkn_attr_vocab = true
spacy_tkn_attr_whitespace_ = true

Parameter	Description
spacy_ignore_bracket	Ignore the tokens which are brackets ?
spacy_ignore_left_punct	Ignore the tokens which are left punctuation marks, e.g. "(" ?
spacy_ignore_line_type_footer	Ignore the tokens from line type footer ?
spacy_ignore_line_type_header	Ignore the tokens from line type header ?
spacy_ignore_line_type_heading	Ignore the tokens from line type heading ?
spacy_ignore_line_type_list_bullet	Ignore the tokens from line type bulleted list ?
spacy_ignore_line_type_list_number	Ignore the tokens from line type numbered list ?
spacy_ignore_line_type_table	Ignore the tokens from line type table ?
spacy_ignore_line_type_toc	Ignore the tokens from line type TOC ?
spacy_ignore_punct	Ignore the tokens which are punctuations ?
spacy_ignore_quote	Ignore the tokens which are quotation marks ?
spacy_ignore_right_punct	Ignore the tokens which are right punctuation marks, e.g. ")" ?
spacy_ignore_space	Ignore the tokens which consist of whitespace characters ?
spacy_ignore_stop	Ignore the tokens which are part of a “stop list” ?

spacy_tkn_attr_cluster	Brown cluster ID.
spacy_tkn_attr_dep_	Syntactic dependency relation.
spacy_tkn_attr_doc	The parent document.
spacy_tkn_attr_ent_iob_	IOB code of named entity tag.
spacy_tkn_attr_ent_kb_id_	Knowledge base ID that refers to the named entity this token is a part of, if any.
spacy_tkn_attr_ent_type_	Named entity type.
spacy_tkn_attr_head	The syntactic parent, or “governor”, of this token.
spacy_tkn_attr_i	The index of the token within the parent document.
spacy_tkn_attr_idx	The character offset of the token within the parent document.
spacy_tkn_attr_is_alpha	Does the token consist of alphabetic characters?
spacy_tkn_attr_is_ascii	Does the token consist of ASCII characters? Equivalent to all (ord(c) < 128 for c in token.text).
spacy_tkn_attr_is_bracket	Is the token a bracket?
spacy_tkn_attr_is_currency	Is the token a currency symbol?
spacy_tkn_attr_is_digit	Does the token consist of digits?
spacy_tkn_attr_is_left_punct	Is the token a left punctuation mark, e.g. "(" ?
spacy_tkn_attr_is_lower	Is the token in lowercase? Equivalent to token.text.islower().
spacy_tkn_attr_is_oov	Is the token out-of-vocabulary?
spacy_tkn_attr_is_punct	Is the token punctuation?
spacy_tkn_attr_is_quote	Is the token a quotation mark?
spacy_tkn_attr_is_right_punct	Is the token a right punctuation mark, e.g. ")" ?
spacy_tkn_attr_is_sent_end	Does the token end a sentence?
spacy_tkn_attr_is_sent_start	Does the token start a sentence?
spacy_tkn_attr_is_space	Does the token consist of whitespace characters? Equivalent to token.text.isspace().
spacy_tkn_attr_is_stop	Is the token part of a “stop list”?
spacy_tkn_attr_is_title	Is the token in titlecase?
spacy_tkn_attr_is_upper	Is the token in uppercase? Equivalent to token.text.isupper().
spacy_tkn_attr_lang_	Language of the parent document’s vocabulary.
spacy_tkn_attr_left_edge	The leftmost token of this token’s syntactic descendants.
spacy_tkn_attr_lemma_	Base form of the token, with no inflectional suffixes.
spacy_tkn_attr_lex	The underlying lexeme.
spacy_tkn_attr_lex_id	Sequential ID of the token’s lexical type, used to index into tables, e.g. for word vectors.
spacy_tkn_attr_like_email	Does the token resemble an email address?
spacy_tkn_attr_like_num	Does the token represent a number?
spacy_tkn_attr_like_url	Does the token resemble a URL?
spacy_tkn_attr_lower_	Lowercase form of the token text. Equivalent to Token.text.lower().
spacy_tkn_attr_morph	Morphological analysis.
spacy_tkn_attr_norm_	The token’s norm, i.e. a normalized form of the token text.
spacy_tkn_attr_orth_	Verbatim text content (identical to Token.text). Exists mostly for consistency with the other attributes.
spacy_tkn_attr_pos_	Coarse-grained part-of-speech from the Universal POS tag set.
spacy_tkn_attr_prefix_	A length-N substring from the start of the token. Defaults to N=1.
spacy_tkn_attr_prob	Smoothed log probability estimate of token’s word type (context-independent entry in the vocabulary).
spacy_tkn_attr_rank	Sequential ID of the token’s lexical type, used to index into tables, e.g. for word vectors.
spacy_tkn_attr_right_edge	The rightmost token of this token’s syntactic descendants.
spacy_tkn_attr_sent	The sentence span that this token is a part of.
spacy_tkn_attr_sentiment	A scalar value indicating the positivity or negativity of the token.
spacy_tkn_attr_shape_	Transform of the token’s string to show orthographic features.
spacy_tkn_attr_suffix_	Length-N substring from the end of the token. Defaults to N=3.
spacy_tkn_attr_tag_	Fine-grained part-of-speech.
spacy_tkn_attr_tensor	The token’s slice of the parent Doc’s tensor.
spacy_tkn_attr_text	Verbatim text content.
spacy_tkn_attr_text_with_ws	Text content, with trailing space character if present.
spacy_tkn_attr_vocab	The vocab object of the parent Doc.
spacy_tkn_attr_whitespace_	Trailing space character if present.

More information about the spaCy token attributes can be found here. DCR-CORE currently supports only a subset of the possible attributes, but this can easily be extended if required.

Detailed information about the universal POS tags can be found here.

DCR-CORE - Application - Configuration

1. logging_cfg.yaml

2. setup.cfg

4. setup.cfg - spaCy Token Attributes

1. `logging_cfg.yaml`

2. `setup.cfg`

4. `setup.cfg` - spaCy Token Attributes