DCR-CORE - Application - Configuration
1. logging_cfg.yaml
This file controls the logging behaviour of the application.
Default content:
version: 1
formatters:
  simple:
    format: "%(asctime)s [%(module)s.py  ] %(levelname)-5s %(funcName)s:%(lineno)d %(message)s"
  extended:
    format: "%(asctime)s [%(module)s.py  ] %(levelname)-5s %(funcName)s:%(lineno)d \n%(message)s"
handlers:
  console:
    class: logging.StreamHandler
    level: INFO
    formatter: simple
  file_handler:
    class: logging.FileHandler
    level: INFO
    filename: logging_dcr_core.log
    formatter: extended
loggers:
  dcr_core:
    handlers: [ console ]
root:
  handlers: [ file_handler ]
2. setup.cfg
This file controls the behaviour of the DCR-CORE application. 
The customisable entries are:
[dcr_core]
create_extra_file_heading = true
create_extra_file_list_bullet = true
create_extra_file_list_number = true
create_extra_file_table = true
delete_auxiliary_files = true
directory_inbox = data/inbox_prod
json_indent = 4
json_sort_keys = false
lt_export_rule_file_heading = data/lt_export_rule_heading.json
lt_export_rule_file_list_bullet = data/lt_export_rule_list_bullet.json
lt_export_rule_file_list_number = data/lt_export_rule_list_number.json
lt_footer_max_distance = 3
lt_footer_max_lines = 3
lt_header_max_distance = 3
lt_header_max_lines = 3
lt_heading_file_incl_no_ctx = 1
lt_heading_file_incl_regexp = false
lt_heading_max_level = 3
lt_heading_min_pages = 2
lt_heading_rule_file = none
lt_heading_tolerance_llx = 10
lt_list_bullet_min_entries = 2
lt_list_bullet_rule_file = none
lt_list_bullet_tolerance_llx = 10
lt_list_number_file_incl_regexp = false
lt_list_number_min_entries = 2
lt_list_number_rule_file = none
lt_list_number_tolerance_llx = 10
lt_table_file_incl_empty_columns = true
lt_toc_last_page = 5
lt_toc_min_entries = 5
pdf2image_type = jpeg
tesseract_timeout = 30
tetml_page = false
tetml_word = false
tokenize_2_database = true
tokenize_2_jsonfile = true
verbose = true
verbose_lt_header_footer = false
verbose_lt_heading = false
verbose_lt_list_bullet = false
verbose_lt_list_number = false
verbose_lt_table = false
verbose_lt_toc = false
verbose_parser = none
| Parameter | Description | 
|---|---|
| create_extra_file_heading | Create a separate JSON file with the table of contents. | 
| create_extra_file_list_bullet | Create a separate JSON file with the bulleted lists. | 
| create_extra_file_list_number | Create a separate JSON file with the numbered lists. | 
| create_extra_file_table | Create a separate JSON file with the tables. | 
| delete_auxiliary_files | Delete the auxiliary files after a successful  processing step.  | 
| directory_inbox | Directory for the new documents received. | 
| json_indent | Improves the readability of the JSON file. | 
| json_sort_keys | If it is set to true, the keys are set in ascending order else, they appear as in the Python object.  | 
| lt_export_rule_file_heading | File name for the export of the heading rules. | 
| lt_export_rule_file_list_bullet | File name for the export of the bulleted list rules. | 
| lt_export_rule_file_list_number | File name for the export of the numbered list rules. | 
| lt_footer_max_distance | Maximum Levenshtein distance for a footer line. | 
| lt_footer_max_lines | Maximum number of footers. | 
| lt_header_max_distance | Maximum Levenshtein distance for a header line. | 
| lt_header_max_lines | Maximum number of headers. | 
| lt_heading_file_incl_no_ctx | The number of lines following the heading to be included as context into the JSON file. | 
| lt_heading_file_incl_regexp | If it is set to true, the regular expression for the heading is included in the JSON file. | 
| lt_heading_max_level | Maximum level of the heading structure. | 
| lt_heading_min_pages | Minimum number of pages to determine the headings. | 
| lt_heading_rule_file | File with rules to determine the headings. | 
| lt_heading_tolerance_llx | Tolerance of vertical indentation in percent. | 
| lt_list_bullet_min_entries | Minimum number of entries to determine a bulleted list. | 
| lt_list_bullet_rule_file | File with rules to determine the bulleted lists. | 
| lt_list_bullet_tolerance_llx | Tolerance of vertical indentation in percent. | 
| lt_list_number_file_incl_regexp | If it is set to true, the regular expression for the numbered list is included in the JSON file. | 
| lt_list_number_min_entries | Minimum number of entries to determine a numbered list. | 
| lt_list_number_rule_file | File with rules to determine the numbered lists. | 
| lt_list_number_tolerance_llx | Tolerance of vertical indentation in percent. | 
| lt_table_file_incl_empty_columns | If it is set to true, the the empty cells are included in the separate JSON file with the tables. | 
| lt_toc_last_page | Maximum number of pages for the search of the TOC (from the beginning). | 
| lt_toc_min_entries | Minimum number of TOC entries. | 
| pdfimage_type | Format of the image files for the scanned pdf document: jpeg or pdf. | 
| tesseract_timeout | Terminate the tesseract job after a  period of time (seconds).  | 
| tetml_page | PDFlib TET granularity 'page'. | 
| tetml_word | PDFlib TET granularity 'word'. | 
| tokenize_2_database | Store the tokens in the database table token. | 
| tokenize_2_jsonfile | Store the tokens in a JSON flat file. | 
| verbose | Display progress messages for processing. | 
| verbose_lt_headers_footers | Display progress messages for headers & footers line type determination. | 
| verbose_lt_heading | Display progress messages for heading line type determination. | 
| verbose_lt_list_bullet | Display progress messages for line type determination of a bulleted list. | 
| verbose_lt_list_number | Display progress messages for line type determination of a numbered list. | 
| verbose_lt_table | Display progress messages for table line type determination. | 
| verbose_lt_toc | Display progress messages for table of content line type determination. | 
| verbose_parser | Display progress messages for parsing xml (TETML) : all, none or text. | 
The configuration parameters can be set differently for the individual environments (dev, prod and test).
Examples:
[dcr_core.env.dev]
delete_auxiliary_files = false
directory_inbox = data/inbox_dev
lt_footer_max_lines = 3
lt_header_max_lines = 3
lt_heading_file_incl_no_ctx = 3
lt_heading_file_incl_regexp = true
lt_heading_tolerance_llx = 5
lt_list_bullet_tolerance_llx = 5
lt_list_number_file_incl_regexp = true
lt_list_number_tolerance_llx = 5
lt_table_file_incl_empty_columns = false
tetml_page = true
tetml_word = true
...
4. setup.cfg - spaCy Token Attributes
The tokens derived from the documents can be qualified via various attributes. The available options are described below.
[dcr_core.spacy]
spacy_ignore_bracket = false
spacy_ignore_left_punct = false
spacy_ignore_line_type_footer = false
spacy_ignore_line_type_header = false
spacy_ignore_line_type_heading = false
spacy_ignore_line_type_list_bullet = false
spacy_ignore_line_type_list_number = false
spacy_ignore_line_type_table = false
spacy_ignore_line_type_toc = false
spacy_ignore_punct = false
spacy_ignore_quote = false
spacy_ignore_right_punct = false
spacy_ignore_space = false
spacy_ignore_stop = false
spacy_tkn_attr_cluster = true
spacy_tkn_attr_dep_ = true
spacy_tkn_attr_doc = true
spacy_tkn_attr_ent_iob_ = true
spacy_tkn_attr_ent_kb_id_ = true
spacy_tkn_attr_ent_type_ = true
spacy_tkn_attr_head = true
spacy_tkn_attr_i = true
spacy_tkn_attr_idx = true
spacy_tkn_attr_is_alpha = true
spacy_tkn_attr_is_ascii = true
spacy_tkn_attr_is_bracket = true
spacy_tkn_attr_is_currency = true
spacy_tkn_attr_is_digit = true
spacy_tkn_attr_is_left_punct = true
spacy_tkn_attr_is_lower = true
spacy_tkn_attr_is_oov = true
spacy_tkn_attr_is_punct = true
spacy_tkn_attr_is_quote = true
spacy_tkn_attr_is_right_punct = true
spacy_tkn_attr_is_sent_end = true
spacy_tkn_attr_is_sent_start = true
spacy_tkn_attr_is_space = true
spacy_tkn_attr_is_stop = true
spacy_tkn_attr_is_title = true
spacy_tkn_attr_is_upper = true
spacy_tkn_attr_lang_ = true
spacy_tkn_attr_left_edge = true
spacy_tkn_attr_lemma_ = true
spacy_tkn_attr_lex = true
spacy_tkn_attr_lex_id = true
spacy_tkn_attr_like_email = true
spacy_tkn_attr_like_num = true
spacy_tkn_attr_like_url = true
spacy_tkn_attr_lower_ = true
spacy_tkn_attr_morph = true
spacy_tkn_attr_norm_ = true
spacy_tkn_attr_orth_ = true
spacy_tkn_attr_pos_ = true
spacy_tkn_attr_prefix_ = true
spacy_tkn_attr_prob = true
spacy_tkn_attr_rank = true
spacy_tkn_attr_right_edge = true
spacy_tkn_attr_sent = true
spacy_tkn_attr_sentiment = true
spacy_tkn_attr_shape_ = true
spacy_tkn_attr_suffix_ = true
spacy_tkn_attr_tag_ = true
spacy_tkn_attr_tensor = true
spacy_tkn_attr_text = true
spacy_tkn_attr_text_with_ws = true
spacy_tkn_attr_vocab = true
spacy_tkn_attr_whitespace_ = true
| Parameter | Description | 
|---|---|
| spacy_ignore_bracket | Ignore the tokens which are brackets ? | 
| spacy_ignore_left_punct | Ignore the tokens which are left punctuation marks, e.g. "(" ? | 
| spacy_ignore_line_type_footer | Ignore the tokens from line type footer ? | 
| spacy_ignore_line_type_header | Ignore the tokens from line type header ? | 
| spacy_ignore_line_type_heading | Ignore the tokens from line type heading ? | 
| spacy_ignore_line_type_list_bullet | Ignore the tokens from line type bulleted list ? | 
| spacy_ignore_line_type_list_number | Ignore the tokens from line type numbered list ? | 
| spacy_ignore_line_type_table | Ignore the tokens from line type table ? | 
| spacy_ignore_line_type_toc | Ignore the tokens from line type TOC ? | 
| spacy_ignore_punct | Ignore the tokens which are punctuations ? | 
| spacy_ignore_quote | Ignore the tokens which are quotation marks ? | 
| spacy_ignore_right_punct | Ignore the tokens which are right punctuation marks, e.g. ")" ? | 
| spacy_ignore_space | Ignore the tokens which consist of whitespace characters ? | 
| spacy_ignore_stop | Ignore the tokens which are part of a “stop list” ? | 
| spacy_tkn_attr_cluster | Brown cluster ID. | 
| spacy_tkn_attr_dep_ | Syntactic dependency relation. | 
| spacy_tkn_attr_doc | The parent document. | 
| spacy_tkn_attr_ent_iob_ | IOB code of named entity tag. | 
| spacy_tkn_attr_ent_kb_id_ | Knowledge base ID that refers to the named entity  this token is a part of, if any.  | 
| spacy_tkn_attr_ent_type_ | Named entity type. | 
| spacy_tkn_attr_head | The syntactic parent, or “governor”, of this token. | 
| spacy_tkn_attr_i | The index of the token within the parent document. | 
| spacy_tkn_attr_idx | The character offset of the token within the parent document. | 
| spacy_tkn_attr_is_alpha | Does the token consist of alphabetic characters? | 
| spacy_tkn_attr_is_ascii | Does the token consist of ASCII characters?  Equivalent to all (ord(c) < 128 for c in token.text).  | 
| spacy_tkn_attr_is_bracket | Is the token a bracket? | 
| spacy_tkn_attr_is_currency | Is the token a currency symbol? | 
| spacy_tkn_attr_is_digit | Does the token consist of digits? | 
| spacy_tkn_attr_is_left_punct | Is the token a left punctuation mark, e.g. "(" ? | 
| spacy_tkn_attr_is_lower | Is the token in lowercase? Equivalent to token.text.islower(). | 
| spacy_tkn_attr_is_oov | Is the token out-of-vocabulary? | 
| spacy_tkn_attr_is_punct | Is the token punctuation? | 
| spacy_tkn_attr_is_quote | Is the token a quotation mark? | 
| spacy_tkn_attr_is_right_punct | Is the token a right punctuation mark, e.g. ")" ? | 
| spacy_tkn_attr_is_sent_end | Does the token end a sentence? | 
| spacy_tkn_attr_is_sent_start | Does the token start a sentence? | 
| spacy_tkn_attr_is_space | Does the token consist of whitespace characters?  Equivalent to token.text.isspace().  | 
| spacy_tkn_attr_is_stop | Is the token part of a “stop list”? | 
| spacy_tkn_attr_is_title | Is the token in titlecase? | 
| spacy_tkn_attr_is_upper | Is the token in uppercase? Equivalent to token.text.isupper(). | 
| spacy_tkn_attr_lang_ | Language of the parent document’s vocabulary. | 
| spacy_tkn_attr_left_edge | The leftmost token of this token’s syntactic descendants. | 
| spacy_tkn_attr_lemma_ | Base form of the token, with no inflectional suffixes. | 
| spacy_tkn_attr_lex | The underlying lexeme. | 
| spacy_tkn_attr_lex_id | Sequential ID of the token’s lexical type, used to index into tables, e.g. for word vectors. | 
| spacy_tkn_attr_like_email | Does the token resemble an email address? | 
| spacy_tkn_attr_like_num | Does the token represent a number? | 
| spacy_tkn_attr_like_url | Does the token resemble a URL? | 
| spacy_tkn_attr_lower_ | Lowercase form of the token text. Equivalent to Token.text.lower(). | 
| spacy_tkn_attr_morph | Morphological analysis. | 
| spacy_tkn_attr_norm_ | The token’s norm, i.e. a normalized form of the token text. | 
| spacy_tkn_attr_orth_ | Verbatim text content (identical to Token.text).  Exists mostly for consistency with the other attributes.  | 
| spacy_tkn_attr_pos_ | Coarse-grained part-of-speech from the Universal POS tag set. | 
| spacy_tkn_attr_prefix_ | A length-N substring from the start of the token.  Defaults to N=1.  | 
| spacy_tkn_attr_prob | Smoothed log probability estimate of token’s word type  (context-independent entry in the vocabulary).  | 
| spacy_tkn_attr_rank | Sequential ID of the token’s lexical type, used to index  into tables, e.g. for word vectors.  | 
| spacy_tkn_attr_right_edge | The rightmost token of this token’s syntactic descendants. | 
| spacy_tkn_attr_sent | The sentence span that this token is a part of. | 
| spacy_tkn_attr_sentiment | A scalar value indicating the positivity or negativity of the token. | 
| spacy_tkn_attr_shape_ | Transform of the token’s string to show orthographic features. | 
| spacy_tkn_attr_suffix_ | Length-N substring from the end of the token. Defaults to N=3. | 
| spacy_tkn_attr_tag_ | Fine-grained part-of-speech. | 
| spacy_tkn_attr_tensor | The token’s slice of the parent Doc’s tensor. | 
| spacy_tkn_attr_text | Verbatim text content. | 
| spacy_tkn_attr_text_with_ws | Text content, with trailing space character if present. | 
| spacy_tkn_attr_vocab | The vocab object of the parent Doc. | 
| spacy_tkn_attr_whitespace_ | Trailing space character if present. | 
More information about the spaCy token attributes can be found here.
DCR-CORE currently supports only a subset of the possible attributes, but this can easily be extended if required.
Detailed information about the universal POS tags can be found here.