feat(ocr): SuryaOCRClient real OCR backend (spec 6.2) #25

Merged
goldstein merged 1 commit from feat/surya-client into main 2026-04-18 10:04:42 +00:00
Owner

Real OCRClient implementation backed by surya-ocr.

Summary

  • SuryaOCRClient with lazy warm_up (predictors load on first call)
  • PDFs rendered via PyMuPDF, images via Pillow; always an independent OCR pass so cross-check vs Tesseract stays meaningful
  • Device defaults to Surya's selection (CUDA on server, MPS on Mac)
  • Extended OCRClient Protocol with optional files + page_metadata kwargs; Fake accepts-and-ignores
  • selfcheck() runs predictors on 1x1 image for /healthz
  • Deferred Surya imports keep base install slim (heavy deps under [ocr])

Test plan

  • 6 hermetic unit tests (Surya predictors mocked, no model download)
  • 2 live tests gated on IX_TEST_OLLAMA=1 never run in CI
  • Full unit + integration suite green (225 passed)
  • Ruff check clean
Real OCRClient implementation backed by surya-ocr. ## Summary - SuryaOCRClient with lazy warm_up (predictors load on first call) - PDFs rendered via PyMuPDF, images via Pillow; always an independent OCR pass so cross-check vs Tesseract stays meaningful - Device defaults to Surya's selection (CUDA on server, MPS on Mac) - Extended OCRClient Protocol with optional files + page_metadata kwargs; Fake accepts-and-ignores - selfcheck() runs predictors on 1x1 image for /healthz - Deferred Surya imports keep base install slim (heavy deps under [ocr]) ## Test plan - [x] 6 hermetic unit tests (Surya predictors mocked, no model download) - [x] 2 live tests gated on IX_TEST_OLLAMA=1 never run in CI - [x] Full unit + integration suite green (225 passed) - [x] Ruff check clean
goldstein added 1 commit 2026-04-18 10:04:37 +00:00
feat(ocr): SuryaOCRClient — real OCR backend (spec §6.2)
All checks were successful
tests / test (push) Successful in 1m14s
tests / test (pull_request) Successful in 1m14s
322f6b2b1b
Runs Surya's detection + recognition over PIL images rendered from each
Page's source file (PDFs via PyMuPDF, images via Pillow). Lazy warm_up
so FastAPI lifespan start stays predictable. Deferred Surya/torch
imports keep the base install slim — the heavy deps stay under [ocr].

Extends OCRClient Protocol with optional files + page_metadata kwargs
so the engine can resolve each page back to its on-disk source; Fake
accepts-and-ignores to keep hermetic tests unchanged.

selfcheck() runs the predictors on a 1x1 PIL image — wired into /healthz
by Task 4.3.

Tests: 6 hermetic unit tests (Surya predictors mocked, no model
download); 2 live tests gated on IX_TEST_OLLAMA=1 (never run in CI).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
goldstein merged commit b737ed7b21 into main 2026-04-18 10:04:42 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: goldstein/infoxtractor#25
No description provided.