infoxtractor/tests
Dirk Riemann 1321d57354
All checks were successful
tests / test (push) Successful in 58s
tests / test (pull_request) Successful in 56s
feat(segmentation): SegmentIndex + prompt-text formatter (spec §9.1)
Builds the ID <-> on-page-anchor map used by both the GenAIStep (to emit the
segment-tagged user message) and the provenance mapper (to resolve LLM-cited
IDs back to bbox/text/file_index).

Design notes:

- `build()` is a classmethod so the pipeline constructs the index in one
  place (OCRStep) and passes the constructed instance along in the internal
  context. No mutable global state; tests build indexes inline from fake
  OCR fixtures.

- Per-page metadata (file_index) arrives via a parallel `list[PageMetadata]`
  rather than being smuggled into OCRResult. Keeps segmentation decoupled
  from ingestion — the OCR engine legitimately doesn't know which file a
  page came from.

- Page-tag lines (`<page …>` / `</page>`) are filtered via a regex so the
  LLM can never cite them as provenance. `line_idx_in_page` increments only
  for real lines so the IDs stay dense (p1_l0, p1_l1, ...).

- Bounding-box normalisation divides x-coords by page width, y-coords by
  page height. Zero dimensions (defensive) pass through unchanged.

- `to_prompt_text(context_texts=[...])` appends paperless-style texts
  untagged, separated from the tagged body by a blank line (spec §7.2b).
  Deterministic for prompt caching.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 10:53:46 +02:00
..
integration fix(ci): create empty tests/integration so pytest doesn't error on missing dir 2026-04-18 10:39:26 +02:00
unit feat(segmentation): SegmentIndex + prompt-text formatter (spec §9.1) 2026-04-18 10:53:46 +02:00
__init__.py feat(scaffold): project skeleton with uv + pytest + forgejo CI 2026-04-18 10:36:43 +02:00
conftest.py feat(scaffold): project skeleton with uv + pytest + forgejo CI 2026-04-18 10:36:43 +02:00