Builds the ID <-> on-page-anchor map used by both the GenAIStep (to emit the
segment-tagged user message) and the provenance mapper (to resolve LLM-cited
IDs back to bbox/text/file_index).
Design notes:
- `build()` is a classmethod so the pipeline constructs the index in one
place (OCRStep) and passes the constructed instance along in the internal
context. No mutable global state; tests build indexes inline from fake
OCR fixtures.
- Per-page metadata (file_index) arrives via a parallel `list[PageMetadata]`
rather than being smuggled into OCRResult. Keeps segmentation decoupled
from ingestion — the OCR engine legitimately doesn't know which file a
page came from.
- Page-tag lines (`<page …>` / `</page>`) are filtered via a regex so the
LLM can never cite them as provenance. `line_idx_in_page` increments only
for real lines so the IDs stay dense (p1_l0, p1_l1, ...).
- Bounding-box normalisation divides x-coords by page width, y-coords by
page height. Zero dimensions (defensive) pass through unchanged.
- `to_prompt_text(context_texts=[...])` appends paperless-style texts
untagged, separated from the tagged body by a blank line (spec §7.2b).
Deterministic for prompt caching.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>