Runs Surya's detection + recognition over PIL images rendered from each Page's source file (PDFs via PyMuPDF, images via Pillow). Lazy warm_up so FastAPI lifespan start stays predictable. Deferred Surya/torch imports keep the base install slim — the heavy deps stay under [ocr]. Extends OCRClient Protocol with optional files + page_metadata kwargs so the engine can resolve each page back to its on-disk source; Fake accepts-and-ignores to keep hermetic tests unchanged. selfcheck() runs the predictors on a 1x1 PIL image — wired into /healthz by Task 4.3. Tests: 6 hermetic unit tests (Surya predictors mocked, no model download); 2 live tests gated on IX_TEST_OLLAMA=1 (never run in CI). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
50 lines
1.6 KiB
Python
50 lines
1.6 KiB
Python
"""OCRClient Protocol (spec §6.2).
|
|
|
|
Structural typing: any object with an async ``ocr(pages) -> OCRResult``
|
|
method satisfies the Protocol. :class:`~ix.pipeline.ocr_step.OCRStep`
|
|
depends on the Protocol, not a concrete class, so swapping engines
|
|
(``FakeOCRClient`` in tests, ``SuryaOCRClient`` in prod) stays a wiring
|
|
change at the app factory.
|
|
|
|
Per-page source location (``files`` + ``page_metadata``) flows in as
|
|
optional kwargs: fakes ignore them; the real
|
|
:class:`~ix.ocr.surya_client.SuryaOCRClient` uses them to render each
|
|
page's pixels back from disk. Keeping these optional lets unit tests stay
|
|
pages-only while production wiring (Task 4.3) plumbs through the real
|
|
filesystem handles.
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
from pathlib import Path
|
|
from typing import Any, Protocol, runtime_checkable
|
|
|
|
from ix.contracts import OCRResult, Page
|
|
|
|
|
|
@runtime_checkable
|
|
class OCRClient(Protocol):
|
|
"""Async OCR backend.
|
|
|
|
Implementations receive the flat page list the pipeline built in
|
|
:class:`~ix.pipeline.setup_step.SetupStep` and return an
|
|
:class:`~ix.contracts.OCRResult` with one :class:`~ix.contracts.Page`
|
|
per input page (in the same order).
|
|
"""
|
|
|
|
async def ocr(
|
|
self,
|
|
pages: list[Page],
|
|
*,
|
|
files: list[tuple[Path, str]] | None = None,
|
|
page_metadata: list[Any] | None = None,
|
|
) -> OCRResult:
|
|
"""Run OCR over the input pages; return the structured result.
|
|
|
|
``files`` and ``page_metadata`` are optional for hermetic tests;
|
|
real engines that need to re-render from disk read them.
|
|
"""
|
|
...
|
|
|
|
|
|
__all__ = ["OCRClient"]
|