infoxtractor/src/ix/segmentation/__init__.py at b2ff27c1ca35674350ce5051eae50bc322ec0dbb - goldstein/infoxtractor - Forgejo: Beyond coding. We Forge.

goldstein/infoxtractor

Dirk Riemann 1321d57354

tests / test (push) Successful in 58s

Details

tests / test (pull_request) Successful in 56s

Details

feat(segmentation): SegmentIndex + prompt-text formatter (spec §9.1)

Builds the ID <-> on-page-anchor map used by both the GenAIStep (to emit the
segment-tagged user message) and the provenance mapper (to resolve LLM-cited
IDs back to bbox/text/file_index).

Design notes:

- `build()` is a classmethod so the pipeline constructs the index in one
  place (OCRStep) and passes the constructed instance along in the internal
  context. No mutable global state; tests build indexes inline from fake
  OCR fixtures.

- Per-page metadata (file_index) arrives via a parallel `list[PageMetadata]`
  rather than being smuggled into OCRResult. Keeps segmentation decoupled
  from ingestion — the OCR engine legitimately doesn't know which file a
  page came from.

- Page-tag lines (`<page …>` / `</page>`) are filtered via a regex so the
  LLM can never cite them as provenance. `line_idx_in_page` increments only
  for real lines so the IDs stay dense (p1_l0, p1_l1, ...).

- Bounding-box normalisation divides x-coords by page width, y-coords by
  page height. Zero dimensions (defensive) pass through unchanged.

- `to_prompt_text(context_texts=[...])` appends paperless-style texts
  untagged, separated from the tagged body by a blank line (spec §7.2b).
  Deterministic for prompt caching.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-18 10:53:46 +02:00

7 lines

225 B

Python

Raw Blame History

 """Segment-index module: maps short IDs (``p1_l0``) to on-page anchors."""
 from __future__ import annotations
 from ix.segmentation.segment_index import PageMetadata, SegmentIndex
 __all__ = ["PageMetadata", "SegmentIndex"]