feat(pipeline): OCRStep (spec §6.2) #13

Merged
goldstein merged 1 commit from feat/step-ocr into main 2026-04-18 09:16:04 +00:00

1 commit

Author SHA1 Message Date
81054baa06 feat(pipeline): OCRStep — run OCR + page tags + SegmentIndex (spec §6.2)
All checks were successful
tests / test (push) Successful in 1m11s
tests / test (pull_request) Successful in 1m13s
Runs after SetupStep. Dispatches the flat page list to the injected
OCRClient, writes the raw OCRResult onto response.ocr_result, injects
<page file="..." number="..."> open/close tag lines around each page's
content, and builds a SegmentIndex over the non-tag lines when
provenance is on.

Validate follows the spec triad rule:
- include_geometries/include_ocr_text/ocr_only + no files -> IX_000_004
- no files -> False (skip)
- files + (use_ocr or triad) -> True

9 unit tests in tests/unit/test_ocr_step.py cover all three validate
branches, OCRResult written, page tags injected (format + file_index),
SegmentIndex built iff provenance on.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 11:15:46 +02:00