Document Classifier

Seven-stage classification pipeline with strategies for ERP and DATEV

The Document Classifier is ELONIQ's central classification engine. It analyses incoming documents, detects their type, extracts metadata and — via pluggable strategies — proposes concrete posting entries. Seven pipeline stages (OCR, Regex, LLM, SmartDB, Learning, Heuristic, E-Invoice) work in cascade and can be enabled or skipped individually.

Overview

The Document Classifier processes incoming documents through a multi-stage pipeline. Each stage can be configured, enabled or disabled individually — and produces results with a confidence that subsequent stages reuse.

Pipeline stages

OCR — text recognition. First stage, always mandatory. Tesseract (local) or Azure (cloud).
Regex — deterministic field extraction via PCRE patterns. Fast and cheap, ideal for structured fields.
LLM — AI-based classification and extraction. Expensive but flexible for free text.
SmartDB — master-data lookup. Validates extracted fields against the ERP DB and enriches them (tenant, vendor, customer).
Learning — collects manual corrections and proposes them as learning rules.
Heuristic — rule-based fallbacks for fields without a clear pattern.
E-Invoice — special handling for ZUGFeRD/XRechnung documents (structured XML overrides OCR results).

Strategy system

The SmartDB stage delegates the actual lookup logic to a strategy. Three strategies are available:

default — simple table lookup without posting logic. For custom SQL joins via the SmartDB tables.
erp — generic ERP strategy. Three logical tables (Client/Vendor/Customer), single- or multi-tenant. Works with any ERP whose schema can be mapped onto it.
datev — DATEV integration with tenant logic. Requires the DATEV integration to be active.

Strategies can be equipped with a dedicated renderer for the Test page via the StrategyWithUI interface (see the DATEV renderer as a reference).

Triggers

Classification can be started in two ways: via watchfolders (file-system scan) or via workflow nodes in ELO. Both trigger types can run in parallel.

Features

Seven-stage pipeline — OCR · Regex · LLM · SmartDB · Learning · Heuristic · Electronic Invoice. Each stage has its own configuration page and stage colour code.
Multi-engine OCR — Tesseract (local, free), Azure Document Intelligence, Azure F1.
Regex stage — deterministic field extraction via PCRE patterns, much faster and cheaper than LLM.
LLM classification — OpenAI, Azure OpenAI, Ollama (local), Anthropic. Prompt-based classification and field extraction.
SmartDB lookup — connect arbitrary databases, match master data (exact, fuzzy, contains).
Strategy system — pluggable classification strategies: default (standard lookup), erp (built-in generic ERP strategy for Client/Vendor/Customer), datev (DATEV integration with tenant logic).
Single- and multi-tenant — the ERP strategy supports both SMBs (one tenant) and tax offices (n tenants).
Learning system — collects manual corrections and reapplies them automatically on similar documents.
Watchfolder + workflow triggers — classification via file-system scan or ELO workflow node.
Test page with modal result — upload any file, run all stages live, show the strategy result in a dedicated Bootstrap modal.
Privacy mode — fully local pipeline without cloud calls for sensitive documents.
Diagnostics page — recent N classifications with stage trace, confidences, OCR snippet and LLM response.
Per-stage confidence thresholds — globally and per stage, fine-tunable.

Usage

1. Plan the pipeline order

Which stages are relevant for my documents? Structured invoices typically need only OCR + Regex + SmartDB. Free text (contracts, mails) needs LLM. ZUGFeRD invoices benefit from the E-Invoice stage.

2. Know your masks and entities

Under Data → Masks, review the ELO masks classification will target. Under Data → Entities, define logical groupings when multiple masks represent the same document type.

3. Configure OCR

Tesseract for local/cheap, Azure for higher recognition rate. Use the Test page to run individual documents and check the OCR result.

4. Write regex patterns

Store PCRE patterns for the deterministic fields (invoice number, date, amount). Validate via the Test page and adjust the confidence.

5. Connect an LLM provider (optional)

Only enable when Regex and SmartDB are not enough. Activate stoponfirstsuccessful so LLM is skipped on confident Regex hits — saves cost.

6. Choose a SmartDB strategy

For ERP master-data integration: enable the ERP strategy, map Client/Vendor/Customer tables, validate via 'Test connection'. For DATEV offices, use the DATEV integration instead.

7. Set up triggers

Watchfolders for file-based sources (scanner, mail importer), workflow triggers for ELO-internal inbox processes.

8. Run the Test page

Before going live, run a fixed sample set (5–10 typical documents) through the Test page. Check the strategy result modal and fine-tune confidence thresholds.

9. Diagnostics in operation

Check the Diagnostics page daily during the first month, then weekly. With the learning system on, clerk corrections automatically flow into learning profiles.

Best Practices

Think in stage order

OCR is always mandatory — no other stage can work without text. After that, this ordering has proven robust: Regex first for deterministic fields (invoice number, date, amounts), LLM only when Regex falls short (free-text descriptions, soft classification), SmartDB at the very end as validation against master data. LLM is the most expensive stage, so enable Stop on first successful — as soon as Regex hits a high confidence, LLM is skipped.

Calibrate confidence thresholds

Two global thresholds under Basis drive the pipeline:

maskconfidencethreshold (default 0.65): below this value the detected mask is flagged as uncertain. Stay conservative — a misidentified mask produces wrong fields.
fieldconfidencethreshold (default 0.5): below this value an extracted field value is discarded. Lower it when you see many false positives, raise it when too many values are lost.

Adjust in 0.05 increments and validate via the Test page against a fixed sample set.

Privacy mode for sensitive documents

Enable privacysensitive for tenants with particularly sensitive data (HR, health, NDAs). Cloud OCR and cloud LLM are then fully skipped — only local providers run. Performance suffers, compliance wins.

Watchfolder vs. workflow trigger

Watchfolders are right for 'push'-style sources (mail importers, scanners, FTP drops). Workflow triggers are right for ELO-internal processes (inbox workflow, approval). Both can run in parallel without conflict — the Job-ID logger prevents double processing.

Enable the learning system from day one

The learning system only pays off after some weeks because it collects manual corrections made by clerks. The earlier it runs, the better the proposals — even when the first weeks show little visible effect.

SmartDB lookups need an ERP or DATEV strategy

The default SmartDB searches master data without posting logic. For actual posting-entry proposals you need a strategy — either the built-in ERP strategy (generic, three tables Client/Vendor/Customer) or the DATEV integration for DATEV tax offices. Without a strategy, SmartDB only returns isolated fields without tenant context.

Use Diagnostics, don't ignore them

The Diagnostics page lists recent classifications with stage trace and per-field confidences. Even when 'everything runs', check it regularly — silent drift in confidence values (e.g. after an LLM model change) becomes visible here immediately.

Examples

Incoming invoice — Regex + ERP strategy

An incoming invoice is imported via watchfolder.

OCR reads the PDF text.
Regex extracts InvoiceNo, Date, Amount and VAT-ID with high confidence.
Mask IncomingInvoice is detected reliably (confidence 0.92).
SmartDB with ERP strategy looks up the vendor in the ERP DB via VAT-ID and IBAN.
Tenant and vendor are written into the result (KREDITOR_GUID, KREDITOR_NAME, MANDANT_NR …).
Workflow trigger starts the ELO inbox workflow with all fields pre-filled.

Contract — LLM classification

A PDF contract from the inbox.

OCR provides the full text.
Regex finds no matching pattern (contract formats are individual).
LLM classifies as Contract and extracts parties, term, notice period.
SmartDB validates the counterparty against master data.
The learning system records the clerk's manual confirmation — on the next similar contract the proposal is accepted automatically.

Multi-tenant DATEV office

A DATEV tax office processes documents for 50+ tenants in a shared DB.

Document arrives via mail importer.
DATEV strategy identifies the tenant from document fields (recipient address, VAT-ID).
Then the vendor is searched under that tenant (tenant-scoped).
Posting account and tax key are taken from master data.

Single-tenant SMB

A small-business setup has exactly one tenant.

ERP strategy configured in single-tenant mode.
The tenant is static (MANDANT_NR 1000, MANDANT_NAME 'Acme GmbH').
For each document the static tenant is set; vendor/customer lookups ignore the tenant column.