Skip to content

DCR-CORE - Development - Line Type Algorithms

GitHub (Pre-)Release GitHub (Pre-)Release Date

The granularity of the document line tries to classify the individual lines. The possible line types are :

line type Meaning
b non-classifiable line, i.e. normal text body line
f footer line
h header line
h_9 level 9 heading line
lb line of a bulleted list
ln line of a numbered list
tab line of a table
toc line of a table of content

The following three rule-based algorithms are used to determine the line type in the order given:

  1. headers & footers The headers and footers are determined by a similarity comparison of the first lt_header_max_lines and last lt_footer_max_lines lines respectively.

  2. close together A table of contents must be in the first lt_toc_last_page pages and consists of either a list or a table with ascending page numbers. Tables have already been marked accordingly by PDFlib TET. The elements of bulleted or numbered lists must be close together and are determined by regular expressions.

  3. headings Headings extend across the entire document and can have hierarchical structures. The headings are determined with rule-enriched regular expressions.

1 Headers & Footers

The following parameter controls both the classification of the headers and the footers:

  • verbose_lt_header_footer

Default value: false - the verbose mode is an option that provides additional details as to what the processing algorithm is doing.

1.1 Footers

1.1.1 Parameters

The following parameters control the classification of the footers:

  • lt_footer_max_distance

Default value: 3 - The degree of similarity of rows is determined by means of the Levenshtein distance. The value zero stands for identical lines. The larger the Levenshtein distance, the more different the rows are. If the header lines do not contain a page numbers, then the parameter should be set to 0.

  • lt_footer_max_lines

Default value: 3 - the number of lines from the bottom of the page to be analyzed as possible candidates for footers. With the value zero the classification of footers is prevented.

  • spacy_ignore_line_type_footer

Default value: true - determines whether the lines of this type are ignored (true) or not (false) during tokenization.

1.1.2 Algorithm

  1. On all pages, the last line n, the line n-1, etc. are compared up to the maximum specified line.
  2. The Levenshtein distance is determined for each pair of lines in the specified range for each current page and the previous page.
  3. The line is considered a footer if, except for pages 1 and 2 and pages n-1 and n, the Levenshtein distance is not greater than the specified maximum value.

1.2 Headers

1.2.1 Parameters

The following parameters control the classification of the headers:

  • lt_header_max_distance

Default value: 3 - the degree of similarity of rows is determined by means of the Levenshtein distance. The value zero stands for identical lines. The larger the Levenshtein distance, the more different the rows are. If the footer lines contain a page number, then depending on the number of pages in the document, the following values are useful:

document pages Levenshtein distance
< 10 1
< 100 2
< 1000 3
  • lt_header_max_lines

Default value: 3 - the number of lines from the top of the page to be analyzed as possible candidates for headers. A value of zero prevents the classification of headers.

  • spacy_ignore_line_type_header

Default value: true - determines whether the lines of this type are ignored (true) or not (false) during tokenization.

1.2.2 Algorithm

  1. On all pages, the first line, the second line, etc. are compared up to the maximum specified line.
  2. The Levenshtein distance is determined for each pair of lines in the specified range for each current page and the previous page.
  3. The line is considered a header if, except for pages 1 and 2 and pages n-1 and n, the Levenshtein distance is not greater than the specified maximum value.

2 TOC (Table of Content)

An attempt is made here to recognise a table of contents contained in the document. There are two main reasons for this:

  1. there is the possibility to ignore the resulting tokens afterwards, and
  2. on the other hand, the table of contents could be in the form of a table, which, however, is then not to be processed as a table in the sense of 4.3.

2.1 Parameters

The following parameters control the classification of a table of contents included in the document:

  • lt_toc_last_page

Default value: 3 - sets the number of pages that will be searched for a table of contents from the beginning of the document. A value of zero prevents the search for a table of contents.

  • lt_toc_min_entries

Default value: 3 - defines the minimum number of entries that a table of contents must contain.

  • spacy_ignore_line_type_toc

Default value: true - determines whether the lines of this type are ignored (true) or not (false) during tokenization.

  • verbose_lt_toc

Default value: false - the verbose mode is an option that provides additional details as to what the processing algorithm is doing.

2.2 Algorithm Table-based

A table with the following properties is searched for:

  • except for the first row, the last column of the table must contain an integer greater than zero,
  • the number found there must be ascending,
  • the number must not be greater than the last page number of the document, and
  • if such a table was found, then the algorithm ends here.

2.3 Algorithm Line-based

A block of lines with the following properties is searched here:

  • the last token from each line must contain an integer greater than zero,
  • the number found there must be ascending, and
  • the number must not be greater than the last page number of the document.

3 Tables

PDFlib TET determines the tables contained in the pdf document and marks them accordingly in its xml output file. DCR-CORE now uses these marks to determine the line type tab and optionally to output the tables in a separate JSON file.

3.1 Parameters

The following parameters control the classification of the tables:

  • create_extra_file_table

Default value: true - if true, a JSON file named <document_name>_table.json is created in the file directory data_accepted with the identified tables.

  • lt_table_file_incl_empty_columns

Default value: true - if true, the empty columns are included in the JSON file <document_name>_table.json.

  • spacy_ignore_line_type_table

Default value: false - determines whether the lines of this type are ignored (true) or not (false) during tokenization.

  • verbose_lt_table

Default value: false - the verbose mode is an option that provides additional details as to what the processing algorithm is doing.

4 Bulleted Lists

An element of a bulleted list extends either over a whole line or over a complete paragraph. All elements of a bulleted list must begin with one or more of the same characters and must not be interrupted by other lines or paragraphs.

4.1 Parameters

The following parameters control the classification of a bulleted list:

  • create_extra_file_list_bullet

Default value: true - if true, a JSON file named <document_name>_list_bullet.json is created in the file directory data_accepted with the identified bulleted lists.

  • lt_list_bullet_min_entries

Default value: 2 - the minimum number of entries in a bulleted list.

  • lt_list_bullet_rule_file

Default value: none - name of a file including file directory that contains the rules for determining the bulleted lists. none means that the given default rules are applied.

  • lt_list_bullet_tolerance_llx

Default value: 5 - percentage tolerance for the differences in indentation of an entry in a bulleted list.

  • spacy_ignore_line_type_list_bullet

Default value: false - determines whether the lines of this type are ignored (true) or not (false) during tokenization.

  • verbose_lt_list_bullet

Default value: false - the verbose mode is an option that provides additional details as to what the processing algorithm is doing.

4.2 Classification Identifiers

The following table shows the standard identifiers in the default processing order:

identifier Hexadecimal
"- " 2D20
". " 2E20
"\ufffd "
"o " 6F20
"° " C2B020
"• " E280A220
"‣ " E280A320

However, these default rules can also be overridden via a JSON file (see parameter lt_list_bullet_rule_file). An example file can be found in the file directory data with the file name line_type_list_bullet_rules.json.

{
  "lineTypeListBulletRules": [
    "- ",
    ". ",
    "\ufffd ",
    "o ",
    "° ",
    "• ",
    "‣ "
  ]
}

5 Numbered Lists

TBD

6 Headings

6.1 Parameters

The following parameters control the classification of the headings:

  • create_extra_file_heading

Default value: true - if true, a JSON file named <document_name>_heading.json is created in the file directory data_accepted with the identified headings.

  • lt_heading_file_incl_no_ctx

Default value: 1 - the n lines following the heading are included as context into the JSON file.

  • lt_heading_file_incl_regexp

Default value: false - if true, the regular expression for the heading is included in the JSON file.

  • lt_heading_max_level

Default value: 3 - the maximum number of hierarchical heading levels.

  • lt_heading_min_pages

Default value: 2 - the minimum number of document pages for determining headings.

  • lt_heading_rule_file

Default value: none - name of a file including file directory that contains the rules for determining the headings. none means that the given default rules are applied.

  • lt_heading_tolerance_llx

Default value: 5 - percentage tolerance for the differences in indentation of a heading at the same level.

  • spacy_ignore_line_type_heading

Default value: false - determines whether the lines of this type are ignored (true) or not (false) during tokenization.

  • verbose_lt_heading

Default value: false - the verbose mode is an option that provides additional details as to what the processing algorithm is doing.

6.2 Classification Rules

A heading classification rule contains the following 5 elements:

Nr. element name description
1 name for documentation purposes, a name that characterises the rule
2 isFirstToken if true, the rule is applied to the first token of the line,
otherwise to the beginning of the line
3 regexp the regular expression to be applied
4 functionIsAsc a comparison function for the values of the predecessor and the successor
5 startValues a list of allowed start values

The following comparison functions (functionIsAsc) can be used:

function description
ignore no comparison is performed
lowercase_letters two lowercase letters are compared,
whereby the ASCII difference must be exactly 1
romans two Roman numerals are compared,
whereby the difference must be exactly 1
strings two strings are compared on ascending
string_floats floating point numbers are compared,
whereby the difference must be greater than 0 and less than 1
string_integers two integer numbers are compared,
whereby the difference must be exactly 1
uppercase_letters two uppercase letters are compared,
whereby the ASCII difference must be exactly 1

The following table shows the standard rules in the default processing order:

name isFirstToken regexp functionIsAsc startValues
(999) True "\(\d+\)$" string_integers ["(1)"]
(A) True "\([A-Z]\)$" uppercase_letters ["(A)"]
(ROM) True see a) romans ["(I)"]
(a) True "\([a-z]\)$" lowercase_letters ["(a)"]
(rom) True see b) romans ["(i)"]
999) True "\d+\)$" string_integers ["1)"]
999. True "\d+\.$" string_integers ["1."]
999.999 True "\d+\.\d\d\d$" string_floats ["1.000, "1.001"]
999.99 True "\d+\.\d\d$" string_floats ["1.00", "1.01"]
999.9 True "\d+\.\d$" string_floats ["1.0", 1.1]
A) True "[A-Z]\)$" uppercase_letters ["A)"]
A. True "[A-Z]\.$" uppercase_letters ["A, "A."]
ROM) True see c) romans ["I)"]
ROM. True see d) romans ["I."]
a) True "[a-z]\)$" lowercase_letters ["a)"]
a. True "[a-z]\.$" lowercase_letters ["a, "a."]
rom) True see e) romans ["i)"]
rom. True see f) romans ["i."]

a) "\(M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})\)$"

b) "\(m{0,3}(cm|cd|d?c{0,3})(xc|xl|l?x{0,3})(ix|iv|v?i{0,3})\)$"

c) "M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})\)$"

d) "M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})\.$"

e) "m{0,3}(cm|cd|d?c{0,3})(xc|xl|l?x{0,3})(ix|iv|v?i{0,3})\)$"

f) "m{0,3}(cm|cd|d?c{0,3})(xc|xl|l?x{0,3})(ix|iv|v?i{0,3})\.$"

However, these default rules can also be overridden via a JSON file (see parameter lt_heading_rule_file). An example file can be found in the file directory data with the file name heading_rules_test.json.

{
  "lineTypeHeadingRules": [
    {
      "name": "(a)",
      "isFirstToken": true,
      "regexp": "\\([a-z]\\)$",
      "functionIsAsc": "lowercase_letters",
      "startValues": [
        "(a)"
      ]
    },
    {
      "name": "(A)",
      "isFirstToken": true,
      "regexp": "\\([A-Z]\\)$",
      "functionIsAsc": "uppercase_letters",
      "startValues": [
        "(A)"
      ]
    },

6.3 Algorithm

  • the document is worked through page by page and within a page line by line
  • for each current heading level there is an entry in a hierarchy table
  • for each document line, this hierarchy table is searched from bottom to top for a matching entry

  • an entry is considered to be matching if

    • the regular expression is satisfied, and
    • the indentation is within the specified tolerance (lt_heading_tolerance_llx), and
    • the comparison function is fulfilled
  • if there is a match, the following processing steps are carried out and then the next document line is processed

    • an entry for the JSON file is optionally created
    • any existing lower entries in the hierarchy table are deleted
  • if no match is found, then the given heading rules are searched in the specified order

  • a heading rule is matching if

    • the regular expression is satisfied, and
    • one of the optional start values matches the document line, and
    • the new heading level is within the specified limit (lt_heading_max_level)
  • if there is a match, the following processing steps are carried out and then the next document line is processed

    • the last heading level is increased by 1,
    • a new entry is added to the hierarchy table
    • an entry for the JSON file is optionally created