Runs Surya's detection + recognition over PIL images rendered from each
Page's source file (PDFs via PyMuPDF, images via Pillow). Lazy warm_up
so FastAPI lifespan start stays predictable. Deferred Surya/torch
imports keep the base install slim — the heavy deps stay under [ocr].
Extends OCRClient Protocol with optional files + page_metadata kwargs
so the engine can resolve each page back to its on-disk source; Fake
accepts-and-ignores to keep hermetic tests unchanged.
selfcheck() runs the predictors on a 1x1 PIL image — wired into /healthz
by Task 4.3.
Tests: 6 hermetic unit tests (Surya predictors mocked, no model
download); 2 live tests gated on IX_TEST_OLLAMA=1 (never run in CI).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the two Protocol-based client contracts the pipeline steps depend on,
plus test-oriented fakes. Real engines (Surya, Ollama) land in Chunk 4.
- ix.ocr.client.OCRClient — runtime_checkable Protocol with async ocr().
- ix.genai.client.GenAIClient — runtime_checkable Protocol with async
invoke(); GenAIInvocationResult + GenAIUsage dataclasses carry the
parsed model, token usage, and model name.
- FakeOCRClient / FakeGenAIClient: return canned results; both expose a
raise_on_call hook for error-path tests.
8 unit tests across tests/unit/test_ocr_fake.py + test_genai_fake.py
confirm protocol conformance, canned-return behaviour, usage/model-name
defaults, and raise_on_call propagation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>