Skip to content

DCR-CORE - Application - Configuration

GitHub (Pre-)Release GitHub (Pre-)Release Date

1. logging_cfg.yaml

This file controls the logging behaviour of the application.

Default content:

version: 1

formatters:
  simple:
    format: "%(asctime)s [%(module)s.py  ] %(levelname)-5s %(funcName)s:%(lineno)d %(message)s"
  extended:
    format: "%(asctime)s [%(module)s.py  ] %(levelname)-5s %(funcName)s:%(lineno)d \n%(message)s"

handlers:
  console:
    class: logging.StreamHandler
    level: INFO
    formatter: simple

  file_handler:
    class: logging.FileHandler
    level: INFO
    filename: logging_dcr_core.log
    formatter: extended

loggers:
  dcr_core:
    handlers: [ console ]
root:
  handlers: [ file_handler ]

2. setup.cfg

This file controls the behaviour of the DCR-CORE application.

The customisable entries are:

[dcr_core]
create_extra_file_heading = true
create_extra_file_list_bullet = true
create_extra_file_list_number = true
create_extra_file_table = true
delete_auxiliary_files = true
directory_inbox = data/inbox_prod
json_indent = 4
json_sort_keys = false
lt_export_rule_file_heading = data/lt_export_rule_heading.json
lt_export_rule_file_list_bullet = data/lt_export_rule_list_bullet.json
lt_export_rule_file_list_number = data/lt_export_rule_list_number.json
lt_footer_max_distance = 3
lt_footer_max_lines = 3
lt_header_max_distance = 3
lt_header_max_lines = 3
lt_heading_file_incl_no_ctx = 1
lt_heading_file_incl_regexp = false
lt_heading_max_level = 3
lt_heading_min_pages = 2
lt_heading_rule_file = none
lt_heading_tolerance_llx = 10
lt_list_bullet_min_entries = 2
lt_list_bullet_rule_file = none
lt_list_bullet_tolerance_llx = 10
lt_list_number_file_incl_regexp = false
lt_list_number_min_entries = 2
lt_list_number_rule_file = none
lt_list_number_tolerance_llx = 10
lt_table_file_incl_empty_columns = true
lt_toc_last_page = 5
lt_toc_min_entries = 5
pdf2image_type = jpeg
tesseract_timeout = 30
tetml_page = false
tetml_word = false
tokenize_2_database = true
tokenize_2_jsonfile = true
verbose = true
verbose_lt_header_footer = false
verbose_lt_heading = false
verbose_lt_list_bullet = false
verbose_lt_list_number = false
verbose_lt_table = false
verbose_lt_toc = false
verbose_parser = none
Parameter Description
create_extra_file_heading Create a separate JSON file with the table of contents.
create_extra_file_list_bullet Create a separate JSON file with the bulleted lists.
create_extra_file_list_number Create a separate JSON file with the numbered lists.
create_extra_file_table Create a separate JSON file with the tables.
delete_auxiliary_files Delete the auxiliary files after a successful
processing step.
directory_inbox Directory for the new documents received.
json_indent Improves the readability of the JSON file.
json_sort_keys If it is set to true, the keys are set
in ascending order else, they appear as
in the Python object.
lt_export_rule_file_heading File name for the export of the heading rules.
lt_export_rule_file_list_bullet File name for the export of the bulleted list rules.
lt_export_rule_file_list_number File name for the export of the numbered list rules.
lt_footer_max_distance Maximum Levenshtein distance for a footer line.
lt_footer_max_lines Maximum number of footers.
lt_header_max_distance Maximum Levenshtein distance for a header line.
lt_header_max_lines Maximum number of headers.
lt_heading_file_incl_no_ctx The number of lines following the heading to be included as context into the JSON file.
lt_heading_file_incl_regexp If it is set to true, the regular expression for the heading is included in the JSON file.
lt_heading_max_level Maximum level of the heading structure.
lt_heading_min_pages Minimum number of pages to determine the headings.
lt_heading_rule_file File with rules to determine the headings.
lt_heading_tolerance_llx Tolerance of vertical indentation in percent.
lt_list_bullet_min_entries Minimum number of entries to determine a bulleted list.
lt_list_bullet_rule_file File with rules to determine the bulleted lists.
lt_list_bullet_tolerance_llx Tolerance of vertical indentation in percent.
lt_list_number_file_incl_regexp If it is set to true, the regular expression for the numbered list is included in the JSON file.
lt_list_number_min_entries Minimum number of entries to determine a numbered list.
lt_list_number_rule_file File with rules to determine the numbered lists.
lt_list_number_tolerance_llx Tolerance of vertical indentation in percent.
lt_table_file_incl_empty_columns If it is set to true, the the empty
cells are included in the separate
JSON file with the tables.
lt_toc_last_page Maximum number of pages for the search of the TOC (from the beginning).
lt_toc_min_entries Minimum number of TOC entries.
pdfimage_type Format of the image files for the scanned
pdf document: jpeg or pdf.
tesseract_timeout Terminate the tesseract job after a
period of time (seconds).
tetml_page PDFlib TET granularity 'page'.
tetml_word PDFlib TET granularity 'word'.
tokenize_2_database Store the tokens in the database table token.
tokenize_2_jsonfile Store the tokens in a JSON flat file.
verbose Display progress messages for processing.
verbose_lt_headers_footers Display progress messages for headers & footers line type determination.
verbose_lt_heading Display progress messages for heading line type determination.
verbose_lt_list_bullet Display progress messages for line type determination of a bulleted list.
verbose_lt_list_number Display progress messages for line type determination of a numbered list.
verbose_lt_table Display progress messages for table line type determination.
verbose_lt_toc Display progress messages for table of content line type determination.
verbose_parser Display progress messages for parsing xml (TETML) :
all, none or text.

The configuration parameters can be set differently for the individual environments (dev, prod and test).

Examples:

[dcr_core.env.dev]
delete_auxiliary_files = false
directory_inbox = data/inbox_dev
lt_footer_max_lines = 3
lt_header_max_lines = 3
lt_heading_file_incl_no_ctx = 3
lt_heading_file_incl_regexp = true
lt_heading_tolerance_llx = 5
lt_list_bullet_tolerance_llx = 5
lt_list_number_file_incl_regexp = true
lt_list_number_tolerance_llx = 5
lt_table_file_incl_empty_columns = false
tetml_page = true
tetml_word = true
...

4. setup.cfg - spaCy Token Attributes

The tokens derived from the documents can be qualified via various attributes. The available options are described below.

[dcr_core.spacy]
spacy_ignore_bracket = false
spacy_ignore_left_punct = false
spacy_ignore_line_type_footer = false
spacy_ignore_line_type_header = false
spacy_ignore_line_type_heading = false
spacy_ignore_line_type_list_bullet = false
spacy_ignore_line_type_list_number = false
spacy_ignore_line_type_table = false
spacy_ignore_line_type_toc = false
spacy_ignore_punct = false
spacy_ignore_quote = false
spacy_ignore_right_punct = false
spacy_ignore_space = false
spacy_ignore_stop = false
spacy_tkn_attr_cluster = true
spacy_tkn_attr_dep_ = true
spacy_tkn_attr_doc = true
spacy_tkn_attr_ent_iob_ = true
spacy_tkn_attr_ent_kb_id_ = true
spacy_tkn_attr_ent_type_ = true
spacy_tkn_attr_head = true
spacy_tkn_attr_i = true
spacy_tkn_attr_idx = true
spacy_tkn_attr_is_alpha = true
spacy_tkn_attr_is_ascii = true
spacy_tkn_attr_is_bracket = true
spacy_tkn_attr_is_currency = true
spacy_tkn_attr_is_digit = true
spacy_tkn_attr_is_left_punct = true
spacy_tkn_attr_is_lower = true
spacy_tkn_attr_is_oov = true
spacy_tkn_attr_is_punct = true
spacy_tkn_attr_is_quote = true
spacy_tkn_attr_is_right_punct = true
spacy_tkn_attr_is_sent_end = true
spacy_tkn_attr_is_sent_start = true
spacy_tkn_attr_is_space = true
spacy_tkn_attr_is_stop = true
spacy_tkn_attr_is_title = true
spacy_tkn_attr_is_upper = true
spacy_tkn_attr_lang_ = true
spacy_tkn_attr_left_edge = true
spacy_tkn_attr_lemma_ = true
spacy_tkn_attr_lex = true
spacy_tkn_attr_lex_id = true
spacy_tkn_attr_like_email = true
spacy_tkn_attr_like_num = true
spacy_tkn_attr_like_url = true
spacy_tkn_attr_lower_ = true
spacy_tkn_attr_morph = true
spacy_tkn_attr_norm_ = true
spacy_tkn_attr_orth_ = true
spacy_tkn_attr_pos_ = true
spacy_tkn_attr_prefix_ = true
spacy_tkn_attr_prob = true
spacy_tkn_attr_rank = true
spacy_tkn_attr_right_edge = true
spacy_tkn_attr_sent = true
spacy_tkn_attr_sentiment = true
spacy_tkn_attr_shape_ = true
spacy_tkn_attr_suffix_ = true
spacy_tkn_attr_tag_ = true
spacy_tkn_attr_tensor = true
spacy_tkn_attr_text = true
spacy_tkn_attr_text_with_ws = true
spacy_tkn_attr_vocab = true
spacy_tkn_attr_whitespace_ = true
Parameter Description
spacy_ignore_bracket Ignore the tokens which are brackets ?
spacy_ignore_left_punct Ignore the tokens which are left punctuation marks, e.g. "(" ?
spacy_ignore_line_type_footer Ignore the tokens from line type footer ?
spacy_ignore_line_type_header Ignore the tokens from line type header ?
spacy_ignore_line_type_heading Ignore the tokens from line type heading ?
spacy_ignore_line_type_list_bullet Ignore the tokens from line type bulleted list ?
spacy_ignore_line_type_list_number Ignore the tokens from line type numbered list ?
spacy_ignore_line_type_table Ignore the tokens from line type table ?
spacy_ignore_line_type_toc Ignore the tokens from line type TOC ?
spacy_ignore_punct Ignore the tokens which are punctuations ?
spacy_ignore_quote Ignore the tokens which are quotation marks ?
spacy_ignore_right_punct Ignore the tokens which are right punctuation marks, e.g. ")" ?
spacy_ignore_space Ignore the tokens which consist of whitespace characters ?
spacy_ignore_stop Ignore the tokens which are part of a “stop list” ?
spacy_tkn_attr_cluster Brown cluster ID.
spacy_tkn_attr_dep_ Syntactic dependency relation.
spacy_tkn_attr_doc The parent document.
spacy_tkn_attr_ent_iob_ IOB code of named entity tag.
spacy_tkn_attr_ent_kb_id_ Knowledge base ID that refers to the named entity
this token is a part of, if any.
spacy_tkn_attr_ent_type_ Named entity type.
spacy_tkn_attr_head The syntactic parent, or “governor”, of this token.
spacy_tkn_attr_i The index of the token within the parent document.
spacy_tkn_attr_idx The character offset of the token within the parent document.
spacy_tkn_attr_is_alpha Does the token consist of alphabetic characters?
spacy_tkn_attr_is_ascii Does the token consist of ASCII characters?
Equivalent to all (ord(c) < 128 for c in token.text).
spacy_tkn_attr_is_bracket Is the token a bracket?
spacy_tkn_attr_is_currency Is the token a currency symbol?
spacy_tkn_attr_is_digit Does the token consist of digits?
spacy_tkn_attr_is_left_punct Is the token a left punctuation mark, e.g. "(" ?
spacy_tkn_attr_is_lower Is the token in lowercase? Equivalent to token.text.islower().
spacy_tkn_attr_is_oov Is the token out-of-vocabulary?
spacy_tkn_attr_is_punct Is the token punctuation?
spacy_tkn_attr_is_quote Is the token a quotation mark?
spacy_tkn_attr_is_right_punct Is the token a right punctuation mark, e.g. ")" ?
spacy_tkn_attr_is_sent_end Does the token end a sentence?
spacy_tkn_attr_is_sent_start Does the token start a sentence?
spacy_tkn_attr_is_space Does the token consist of whitespace characters?
Equivalent to token.text.isspace().
spacy_tkn_attr_is_stop Is the token part of a “stop list”?
spacy_tkn_attr_is_title Is the token in titlecase?
spacy_tkn_attr_is_upper Is the token in uppercase? Equivalent to token.text.isupper().
spacy_tkn_attr_lang_ Language of the parent document’s vocabulary.
spacy_tkn_attr_left_edge The leftmost token of this token’s syntactic descendants.
spacy_tkn_attr_lemma_ Base form of the token, with no inflectional suffixes.
spacy_tkn_attr_lex The underlying lexeme.
spacy_tkn_attr_lex_id Sequential ID of the token’s lexical type, used to index into tables, e.g. for word vectors.
spacy_tkn_attr_like_email Does the token resemble an email address?
spacy_tkn_attr_like_num Does the token represent a number?
spacy_tkn_attr_like_url Does the token resemble a URL?
spacy_tkn_attr_lower_ Lowercase form of the token text. Equivalent to Token.text.lower().
spacy_tkn_attr_morph Morphological analysis.
spacy_tkn_attr_norm_ The token’s norm, i.e. a normalized form of the token text.
spacy_tkn_attr_orth_ Verbatim text content (identical to Token.text).
Exists mostly for consistency with the other attributes.
spacy_tkn_attr_pos_ Coarse-grained part-of-speech from the Universal POS tag set.
spacy_tkn_attr_prefix_ A length-N substring from the start of the token.
Defaults to N=1.
spacy_tkn_attr_prob Smoothed log probability estimate of token’s word type
(context-independent entry in the vocabulary).
spacy_tkn_attr_rank Sequential ID of the token’s lexical type, used to index
into tables, e.g. for word vectors.
spacy_tkn_attr_right_edge The rightmost token of this token’s syntactic descendants.
spacy_tkn_attr_sent The sentence span that this token is a part of.
spacy_tkn_attr_sentiment A scalar value indicating the positivity or negativity of the token.
spacy_tkn_attr_shape_ Transform of the token’s string to show orthographic features.
spacy_tkn_attr_suffix_ Length-N substring from the end of the token. Defaults to N=3.
spacy_tkn_attr_tag_ Fine-grained part-of-speech.
spacy_tkn_attr_tensor The token’s slice of the parent Doc’s tensor.
spacy_tkn_attr_text Verbatim text content.
spacy_tkn_attr_text_with_ws Text content, with trailing space character if present.
spacy_tkn_attr_vocab The vocab object of the parent Doc.
spacy_tkn_attr_whitespace_ Trailing space character if present.

More information about the spaCy token attributes can be found here. DCR-CORE currently supports only a subset of the possible attributes, but this can easily be extended if required.

Detailed information about the universal POS tags can be found here.