infoxtractor/tests
Dirk Riemann 81054baa06
All checks were successful
tests / test (push) Successful in 1m11s
tests / test (pull_request) Successful in 1m13s
feat(pipeline): OCRStep — run OCR + page tags + SegmentIndex (spec §6.2)
Runs after SetupStep. Dispatches the flat page list to the injected
OCRClient, writes the raw OCRResult onto response.ocr_result, injects
<page file="..." number="..."> open/close tag lines around each page's
content, and builds a SegmentIndex over the non-tag lines when
provenance is on.

Validate follows the spec triad rule:
- include_geometries/include_ocr_text/ocr_only + no files -> IX_000_004
- no files -> False (skip)
- files + (use_ocr or triad) -> True

9 unit tests in tests/unit/test_ocr_step.py cover all three validate
branches, OCRResult written, page tags injected (format + file_index),
SegmentIndex built iff provenance on.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 11:15:46 +02:00
..
integration fix(ci): create empty tests/integration so pytest doesn't error on missing dir 2026-04-18 10:39:26 +02:00
unit feat(pipeline): OCRStep — run OCR + page tags + SegmentIndex (spec §6.2) 2026-04-18 11:15:46 +02:00
__init__.py feat(scaffold): project skeleton with uv + pytest + forgejo CI 2026-04-18 10:36:43 +02:00
conftest.py feat(scaffold): project skeleton with uv + pytest + forgejo CI 2026-04-18 10:36:43 +02:00