infoxtractor

History

Dirk Riemann 1321d57354 All checks were successful tests / test (push) Successful in 58s Details tests / test (pull_request) Successful in 56s Details feat(segmentation): SegmentIndex + prompt-text formatter (spec §9.1) Builds the ID <-> on-page-anchor map used by both the GenAIStep (to emit the segment-tagged user message) and the provenance mapper (to resolve LLM-cited IDs back to bbox/text/file_index). Design notes: - `build()` is a classmethod so the pipeline constructs the index in one place (OCRStep) and passes the constructed instance along in the internal context. No mutable global state; tests build indexes inline from fake OCR fixtures. - Per-page metadata (file_index) arrives via a parallel `list[PageMetadata]` rather than being smuggled into OCRResult. Keeps segmentation decoupled from ingestion — the OCR engine legitimately doesn't know which file a page came from. - Page-tag lines (`<page …>` / `</page>`) are filtered via a regex so the LLM can never cite them as provenance. `line_idx_in_page` increments only for real lines so the IDs stay dense (p1_l0, p1_l1, ...). - Bounding-box normalisation divides x-coords by page width, y-coords by page height. Zero dimensions (defensive) pass through unchanged. - `to_prompt_text(context_texts=[...])` appends paperless-style texts untagged, separated from the tagged body by a blank line (spec §7.2b). Deterministic for prompt caching. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-04-18 10:53:46 +02:00
..
integration	fix(ci): create empty tests/integration so pytest doesn't error on missing dir	2026-04-18 10:39:26 +02:00
unit	feat(segmentation): SegmentIndex + prompt-text formatter (spec §9.1)	2026-04-18 10:53:46 +02:00
__init__.py	feat(scaffold): project skeleton with uv + pytest + forgejo CI	2026-04-18 10:36:43 +02:00
conftest.py	feat(scaffold): project skeleton with uv + pytest + forgejo CI	2026-04-18 10:36:43 +02:00