DCR - Developing - Version Planning
1. Version Planning
1.1 Open
Version | Feature(s) |
---|---|
0.9.8 | TBD |
1.2 Already implemented
Version | Feature(s) |
---|---|
0.9.7 | Documentation and test improvements |
0.9.6 | Extracting an API |
0.9.3 | Extending NLP capabilities |
0.9.2 | Refactoring database and code |
0.9.1 | Core text preprocessing and wrangling |
0.9.0 | Parser |
0.8.0 | PDFlib TET processing |
0.7.0 | Tesseract OCR processing |
0.6.5 | Pandoc processing |
0.6.0 | pdf for Tesseract OCR processing |
0.5.0 | Inbox processing |
2. Next Development Steps
2.1 Open
2.1.1 High Priority
- pandoc_dcr: convert
doc
documents todocx
- user: reconstruct original document
2.1.2 Normal Priority
- API documentation: Content improvement
- API documentation: Layout improvement
- admin: reset a list of documents: clean up the database before the next process retry - delete existing data
- tool: check the content of the file directory against the database
- line type header & footer:
- optional: ignore the first / last page
- optional: logging of the applied method
- optional: page number alternating in the first and last token
- optional: page number always in the first / last token
- optional: use of Levenshtein algorithm
- optional: use of language-related regular expressions to determine the header / footer with the page number
- line type table:
- check the coordinates
- table with page break
2.1.3 Low Priority
- Google Styleguide implementation
2.2 Already implemented
- API Documentation
- PDFlib TET processing
- Tesseract OCR - Installation
- all: database table 'document': new column file_size
- all: database table 'document': new column no_pages_pdf
- all: merge database table 'journal' into 'document'
- clean up the auxiliary files in file directory inbox_accepted - keep the base document
- combine
pdf
files - scannedpdf
documents - after Tesseract OCR - convert the appropriate documents into the
pdf
format with Pandoc and TeX Live - duplicate handling
- error correction version 0.9.0
- error handling - highly defensive
- inbox.py - process_inbox() - processing ocr & non-ocr in the same method
- introduce default language
- introduce document language - eventually inbox sub folder per language
- load initialisation data
- optionally save the original document in the database
- parser result with JSON
- parser: classify the lines, e.g. body, footer, header etc.
- replace TeX Live by LuaLaTeX or XeLuTeX (Unicode)
- test cases for file duplicate