feat(pipeline): OCRStep (spec §6.2) #13

Merged
goldstein merged 1 commit from feat/step-ocr into main 2026-04-18 09:16:04 +00:00
Owner

Chunk 2, Task 2.5.

OCR step: runs the OCRClient, injects page tags, builds SegmentIndex.

Tests

9 new tests (157 total). uv run pytest tests/unit -q -> 157 passed. uv run ruff check src tests -> clean.

Merge gate

Forgejo Actions trigger bug is still in effect — local test + ruff are the gate.

Chunk 2, Task 2.5. OCR step: runs the OCRClient, injects page tags, builds SegmentIndex. ## Tests 9 new tests (157 total). `uv run pytest tests/unit -q` -> 157 passed. `uv run ruff check src tests` -> clean. ## Merge gate Forgejo Actions trigger bug is still in effect — local test + ruff are the gate.
goldstein added 1 commit 2026-04-18 09:15:56 +00:00
feat(pipeline): OCRStep — run OCR + page tags + SegmentIndex (spec §6.2)
All checks were successful
tests / test (push) Successful in 1m11s
tests / test (pull_request) Successful in 1m13s
81054baa06
Runs after SetupStep. Dispatches the flat page list to the injected
OCRClient, writes the raw OCRResult onto response.ocr_result, injects
<page file="..." number="..."> open/close tag lines around each page's
content, and builds a SegmentIndex over the non-tag lines when
provenance is on.

Validate follows the spec triad rule:
- include_geometries/include_ocr_text/ocr_only + no files -> IX_000_004
- no files -> False (skip)
- files + (use_ocr or triad) -> True

9 unit tests in tests/unit/test_ocr_step.py cover all three validate
branches, OCRResult written, page tags injected (format + file_index),
SegmentIndex built iff provenance on.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
goldstein merged commit acb2d55ce3 into main 2026-04-18 09:16:04 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: goldstein/infoxtractor#13
No description provided.