infoxtractor/docs/superpowers/specs/2026-04-18-ix-mvp-design.md
Dirk Riemann 124403252d Initial design: on-prem LLM extraction microservice MVP
Establishes ix as an async, on-prem, LLM-powered structured extraction
microservice. Full reference spec stays in docs/spec-core-pipeline.md;
MVP spec (strict subset — Ollama only, Surya OCR, REST + Postgres-queue
transports in parallel, in-repo use cases, provenance-based reliability
signals) lives at docs/superpowers/specs/2026-04-18-ix-mvp-design.md.

First use case: bank_statement_header (feeds mammon's needs_parser flow).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 10:23:17 +02:00

23 KiB
Raw Blame History

InfoXtractor (ix) MVP — Design

Date: 2026-04-18 Reference: docs/spec-core-pipeline.md (full, aspirational spec — MVP is a strict subset) Status: Design approved (sections 18 walked through and accepted 2026-04-18)

0. One-paragraph summary

ix is an on-prem, async, LLM-powered microservice that extracts structured JSON from documents (PDFs, images, text) given a named use case (a Pydantic schema + system prompt). It returns the extracted fields together with per-field provenance (OCR segment IDs, bounding boxes, extracted-value agreement flags) that let calling services decide how much to trust each value. The MVP ships one use case (bank_statement_header), one OCR engine (Surya, pluggable), one LLM backend (Ollama, pluggable), and two transports in parallel (REST with optional webhook callback + a Postgres queue). Cloud services are explicitly forbidden. The first consumer is mammon, which uses ix as a fallback for needs_parser documents.

1. Guiding principles

  • On-prem always. No OpenAI, Anthropic, Azure (DI/CV/OpenAI), AWS (Bedrock/Textract), Google Document AI, Mistral, etc. LLM = Ollama (:11434). OCR = local engines only. Secrets never leave the home server. The spec's cloud references are examples to replace, not inherit.
  • Grounded extraction, not DB truth. ix returns best-effort fields with segment citations, provenance verification, and cross-OCR agreement flags. ix does not claim DB-grade truth. The reliability decision (trust / stage for review / reject) belongs to the caller.
  • Transport-agnostic pipeline core. The pipeline (RequestIXResponseIX) knows nothing about HTTP, queues, or databases. Adapters (REST, Postgres queue) run alongside the core; both converge on one shared job store.
  • YAGNI for all spec features the MVP doesn't need. Kafka, Config Server, Azure/AWS clients, vision, word-level provenance, reasoning-effort routing, Prometheus/OTEL exporters: deferred. Architecture leaves the interfaces so they can be added without touching the pipeline core.

2. Architecture

┌──────────────────────────────────────────────────────────────────┐
│  infoxtractor container (Docker on 192.168.68.42, port 8994)     │
│                                                                  │
│   ┌──────────────────┐      ┌──────────────────────────┐         │
│   │ rest_adapter     │      │ pg_queue_adapter         │         │
│   │ (FastAPI)        │      │ (asyncio worker)         │         │
│   │  POST /jobs      │      │  LISTEN ix_jobs_new +    │         │
│   │  GET  /jobs/{id} │      │  SELECT ... FOR UPDATE   │         │
│   │  + callback_url  │      │        SKIP LOCKED       │         │
│   └────────┬─────────┘      └────────┬─────────────────┘         │
│            │                         │                           │
│            └──────────┬──────────────┘                           │
│                       ▼                                          │
│              ┌────────────────┐                                  │
│              │ ix_jobs table  │  ── postgis :5431, DB=infoxtractor│
│              └────────┬───────┘                                  │
│                       ▼                                          │
│         ┌─────────────────────────────┐                          │
│         │  Pipeline core (spec §3§4) │                          │
│         │                             │                          │
│         │  SetupStep → OCRStep →      │                          │
│         │  GenAIStep → ReliabilityStep│                          │
│         │         → ResponseHandler   │                          │
│         │                             │                          │
│         │  Uses: OCRClient (Surya),   │                          │
│         │        GenAIClient (Ollama),│                          │
│         │        UseCaseRegistry      │                          │
│         └─────────────────────────────┘                          │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘
            │                                ▲
            ▼                                │
   host.docker.internal:11434       mammon or any on-prem caller —
   Ollama (gpt-oss:20b default)     polls GET /jobs/{id} until done

Key shapes:

  • Spec's four steps + a new fifth: ReliabilityStep runs between GenAIStep and ResponseHandlerStep, computes per-field provenance_verified and text_agreement flags. Isolated so callers and tests can reason about reliability signals independently.
  • Single worker at MVP (PIPELINE_WORKER_CONCURRENCY=1). Ollama + Surya both want the GPU serially.
  • Two transports, one job store. REST is the primary; pg queue is scaffolded, uses the same table, same lifecycle.
  • Use case registry in-repo (ix/use_cases/__init__.py); adding a new use case = new module + one registry line.

3. Data contracts

Subset of spec §2 / §9.3. Dropped fields are no-ops under the MVP's feature set.

RequestIX

class RequestIX(BaseModel):
    use_case: str                    # registered name, e.g. "bank_statement_header"
    ix_client_id: str                # caller tag for logs/metrics, e.g. "mammon"
    request_id: str                  # caller's correlation id; echoed back
    ix_id: Optional[str]             # transport-assigned short hex id
    context: Context
    options: Options = Options()
    callback_url: Optional[str]      # optional webhook delivery (one-shot, no retry)

class Context(BaseModel):
    files: list[str] = []            # URLs or file:// paths
    texts: list[str] = []            # extra text (e.g. Paperless OCR output)

class Options(BaseModel):
    ocr: OCROptions = OCROptions()
    gen_ai: GenAIOptions = GenAIOptions()
    provenance: ProvenanceOptions = ProvenanceOptions()

class OCROptions(BaseModel):
    use_ocr: bool = True
    ocr_only: bool = False
    include_ocr_text: bool = False
    include_geometries: bool = False
    service: Literal["surya"] = "surya"   # kept so the adapter point is visible

class GenAIOptions(BaseModel):
    gen_ai_model_name: Optional[str] = None   # None → use-case default → IX_DEFAULT_MODEL

class ProvenanceOptions(BaseModel):
    include_provenance: bool = True           # default ON (reliability is the point)
    max_sources_per_field: int = 10

Dropped from spec (no-ops under MVP): OCROptions.computer_vision_scaling_factor, include_page_tags (always on), GenAIOptions.use_vision/vision_scaling_factor/vision_detail/reasoning_effort, ProvenanceOptions.granularity/include_bounding_boxes/source_type/min_confidence, RequestIX.version.

ResponseIX

Identical to spec §2.2 except FieldProvenance gains two fields:

class FieldProvenance(BaseModel):
    field_name: str
    field_path: str
    value: Any
    sources: list[ExtractionSource]
    confidence: Optional[float] = None           # reserved; always None in MVP
    provenance_verified: bool                    # NEW: cited segment actually contains value (normalized)
    text_agreement: Optional[bool]               # NEW: value appears in RequestIX.context.texts; None if no texts

quality_metrics gains two counters: verified_fields, text_agreement_fields.

Job envelope (in ix_jobs table; returned by REST)

class Job(BaseModel):
    job_id: UUID
    ix_id: str
    client_id: str
    request_id: str
    status: Literal["pending", "running", "done", "error"]
    request: RequestIX
    response: Optional[ResponseIX]
    callback_url: Optional[str]
    callback_status: Optional[Literal["pending", "delivered", "failed"]]
    attempts: int = 0
    created_at: datetime
    started_at: Optional[datetime]
    finished_at: Optional[datetime]

4. Job store

CREATE DATABASE infoxtractor;   -- on the existing postgis container

CREATE TABLE ix_jobs (
    job_id          UUID PRIMARY KEY,
    ix_id           TEXT NOT NULL,
    client_id       TEXT NOT NULL,
    request_id      TEXT NOT NULL,
    status          TEXT NOT NULL,
    request         JSONB NOT NULL,
    response        JSONB,
    callback_url    TEXT,
    callback_status TEXT,
    attempts        INT  NOT NULL DEFAULT 0,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    started_at      TIMESTAMPTZ,
    finished_at     TIMESTAMPTZ
);
CREATE INDEX ix_jobs_status_created  ON ix_jobs (status, created_at) WHERE status = 'pending';
CREATE INDEX ix_jobs_client_request  ON ix_jobs (client_id, request_id);
-- Postgres NOTIFY channel used by the pg_queue_adapter: 'ix_jobs_new'

Callers that prefer direct SQL (the pg_queue_adapter contract): insert a row with status='pending', then NOTIFY ix_jobs_new, '<job_id>'. The worker also falls back to a 10 s poll so a missed notify or ix restart doesn't strand a job.

5. REST surface

Method Path Purpose
POST /jobs Body = RequestIX (+ optional callback_url). → 201 {job_id, ix_id, status: "pending"}. Idempotent on (ix_client_id, request_id) — same pair returns the existing job_id with 200.
GET /jobs/{job_id} → full Job. Source of truth regardless of submission path or callback outcome.
GET /jobs?client_id=…&request_id=… Lookup-by-correlation (caller idempotency helper). Returns latest match or 404.
GET /healthz {ollama: ok/fail, postgres: ok/fail, ocr: ok/fail}. Used by infrastructure monitoring dashboard.
GET /metrics Counters: jobs_pending, jobs_running, jobs_done_24h, jobs_error_24h, per-use-case avg seconds. Plain JSON, no Prometheus format for MVP.

Callback delivery (when callback_url is set): one POST of the full Job body, 10 s timeout. 2xx → callback_status='delivered'. Anything else → 'failed'. No retry. Callers always have GET /jobs/{id} as the authoritative fallback.

6. Pipeline steps

Interface per spec §3 (async validate + async process). Pipeline orchestration per spec §4 (first error aborts; each step wrapped in a Timer landing in Metadata.timings).

SetupStep

  • validate: request_ix non-null; context.files or context.texts non-empty.
  • process:
    • Copy request_ix.context.textsresponse_ix.context.texts.
    • Download each URL in context.files to /tmp/ix/<ix_id>/ in parallel. MIME detection via python-magic. Supported: PDF, PNG, JPEG, TIFF. Unsupported → IX_000_005.
    • Load use case: request_cls, response_cls = REGISTRY[request_ix.use_case]. Store instances in response_ix.context.use_case_request / use_case_response. Echo use_case_request.use_case_nameresponse_ix.use_case_name.
    • Build flat response_ix.context.pages: one entry per PDF page (via PyMuPDF), one per image frame, one per text entry. Hard cap 100 pages/PDF → IX_000_006 on violation.

OCRStep

  • validate: returns True iff (use_ocr or ocr_only or include_geometries or include_ocr_text) and context.files. Otherwise False → step skipped (text-only requests).
  • process: ocr_result = await OCRClient.ocr(context.pages)response_ix.ocr_result. Always inject <page file="{item_index}" number="{page_no}"> tags (simplifies grounding). If include_provenance: build SegmentIndex (line granularity, normalized bboxes 0-1) and store in context.segment_index.
  • OCRClient interface:
    class OCRClient(Protocol):
        async def ocr(self, pages: list[Page]) -> OCRResult: ...
    
    MVP implementation: SuryaOCRClient (GPU via surya-ocr PyPI package, CUDA on the RTX 3090).

GenAIStep

  • validate: ocr_onlyFalse (skip). Use case must exist. OCR text or context.texts must be non-empty (else IX_001_000).
  • process:
    • System prompt = use_case_request.system_prompt. If include_provenance: append spec §9.2 citation instruction verbatim.
    • User text: segment-tagged ([p1_l0] …) when provenance is on; plain concatenated OCR + texts otherwise.
    • Response schema: UseCaseResponse directly, or the dynamic ProvenanceWrappedResponse(result=..., segment_citations=...) per spec §7.2e when provenance is on.
    • Model: request_ix.options.gen_ai.gen_ai_model_nameuse_case_request.default_modelIX_DEFAULT_MODEL.
    • Call GenAIClient.invoke(request_kwargs, response_schema); parsed model → ix_result.result, usage + model_name → ix_result.meta_data.
    • If provenance: call ProvenanceUtils.map_segment_refs_to_provenance(...) per spec §9.4, write response_ix.provenance.
  • GenAIClient interface:
    class GenAIClient(Protocol):
        async def invoke(self, request_kwargs: dict, response_schema: type[BaseModel]) -> GenAIInvocationResult: ...
    
    MVP implementation: OllamaClientPOST http://host.docker.internal:11434/api/chat with format = <JSON schema from Pydantic> (Ollama structured outputs).

ReliabilityStep (new; runs when include_provenance is True)

For each FieldProvenance in response_ix.provenance.fields:

  • provenance_verified: for each cited segment, compare text_snippet to the extracted value after normalization (see below). If any cited segment agrees → True. Else False.
  • text_agreement: if request_ix.context.texts is empty → None. Else run the same comparison against the concatenated texts → True / False.

Normalization rules (cheap, language-neutral, applied to both sides before in-check):

  • Strings: Unicode NFKC, casefold, collapse whitespace, strip common punctuation.
  • Numbers (int, float, Decimal values): digits-and-sign only; strip currency symbols, thousands separators, decimal-separator variants (./,); require exact match to 2 decimal places for amounts.
  • Dates: parse to ISO YYYY-MM-DD; compare as strings. Accept common German / Swiss / US formats.
  • IBANs: uppercase, strip spaces.
  • Very short values (≤ 2 chars, or numeric |value| < 10): text_agreement skipped (returns None) — too noisy to be a useful signal.

Records are mutations to the provenance structure only; does not drop fields. Caller sees every extracted field + the flags.

Writes quality_metrics.verified_fields and quality_metrics.text_agreement_fields summary counts.

ResponseHandlerStep

Per spec §8, unchanged. Attach flat OCR text when include_ocr_text; strip ocr_result.pages unless include_geometries; delete context before serialization.

7. Use case registry

ix/use_cases/
  __init__.py            # REGISTRY: dict[str, tuple[type[UseCaseRequest], type[UseCaseResponse]]]
  bank_statement_header.py

Adding a use case = new module exporting a Request(BaseModel) (use_case_name, default_model, system_prompt) and a UseCaseResponse(BaseModel), then one REGISTRY["<name>"] = (Request, UseCaseResponse) line.

First use case: bank_statement_header

class BankStatementHeader(BaseModel):
    bank_name: str
    account_iban: Optional[str]
    account_type: Optional[Literal["checking", "credit", "savings"]]
    currency: str
    statement_date: Optional[date]
    statement_period_start: Optional[date]
    statement_period_end: Optional[date]
    opening_balance: Optional[Decimal]
    closing_balance: Optional[Decimal]

class Request(BaseModel):
    use_case_name: str = "Bank Statement Header"
    default_model: str = "gpt-oss:20b"
    system_prompt: str = (
        "You extract header metadata from a single bank or credit-card statement. "
        "Return only facts that appear in the document; leave a field null if uncertain. "
        "Balances must use the document's numeric format (e.g. '1234.56' or '-123.45'); "
        "do not invent a currency symbol. Account type: 'checking' for current/Giro accounts, "
        "'credit' for credit-card statements, 'savings' otherwise. Always return the IBAN "
        "with spaces removed. Never fabricate a value to fill a required-looking field."
    )

Why these fields: each appears at most once per document (one cite per field → strong provenance_verified signal); all reconcile against something mammon already stores (IBAN → Account.iban, period → verified-range chain, closing_balance → next month's opening_balance and StatementBalance); schema is flat (no nested arrays where Ollama structured output tends to drift).

8. Errors and warnings

Error-code subset from spec §12.2 (reusing codes as-is where meaning is identical):

Code Trigger
IX_000_000 request_ix is None
IX_000_002 No context (neither files nor texts)
IX_000_004 OCR required but no files provided
IX_000_005 File MIME type not supported
IX_000_006 PDF page-count cap exceeded
IX_001_000 use_case empty, or extraction context (OCR + texts) empty after setup
IX_001_001 Use case name not in REGISTRY

Warnings (non-fatal, appended to response_ix.warning): empty OCR result, provenance requested with use_ocr=False, unknown model falling back to default.

9. Configuration (AppConfig via pydantic-settings)

Key env var Default Meaning
IX_POSTGRES_URL postgresql+asyncpg://infoxtractor:…@host.docker.internal:5431/infoxtractor Job store
IX_OLLAMA_URL http://host.docker.internal:11434 LLM backend
IX_DEFAULT_MODEL gpt-oss:20b Fallback model
IX_OCR_ENGINE surya Adapter selector (only value in MVP)
IX_TMP_DIR /tmp/ix Download scratch
IX_PIPELINE_WORKER_CONCURRENCY 1 Worker semaphore cap
IX_PIPELINE_REQUEST_TIMEOUT_SECONDS 2700 Per-job timeout (45 min)
IX_RENDER_MAX_PIXELS_PER_PAGE 75000000 Per-page render cap
IX_LOG_LEVEL INFO
IX_CALLBACK_TIMEOUT_SECONDS 10 Webhook POST timeout

No Azure, OpenAI, or AWS variables — those paths do not exist in the codebase.

10. Observability (minimal)

  • Logs: JSON-structured via logging + custom formatter. Every line carries ix_id, client_id, request_id, use_case. Steps emit step_start / step_end events with elapsed ms.
  • Timings: every step's elapsed-seconds recorded in response_ix.metadata.timings (same shape as spec §2).
  • Traces: OpenTelemetry span scaffolding present, no exporter wired. Drop-in later.
  • Prometheus: deferred.

11. Deployment

  • Repo: goldstein/infoxtractor on Forgejo, plus server bare-repo remote with post-receive hook mirroring mammon.
  • Port 8994 (LAN-only via UFW; not exposed publicly — internal service).
  • Postgres: new infoxtractor database on existing postgis container.
  • Ollama reached via host.docker.internal:11434.
  • Monitoring label: infrastructure.web_url=http://192.168.68.42:8994.
  • Backup: backup.enable=true, backup.type=postgres, backup.name=infoxtractor.
  • Dockerfile: CUDA-enabled base (nvidia/cuda:12.4-runtime-ubuntu22.04 + Python 3.12) so Surya can use the 3090. CMD: alembic upgrade head && uvicorn ix.app:create_app --factory --host 0.0.0.0 --port 8994.

12. Testing strategy

Strict TDD — each unit is written test-first.

  1. Unit tests (fast, hermetic): every Step, SegmentIndex, provenance-verification normalizers, OCRClient contract, GenAIClient contract, error mapping. No DB, no Ollama, no network.
  2. Integration tests (DB + fakes): pipeline end-to-end with stub OCRClient (replays canned OCR results) and stub GenAIClient (replays canned LLM JSON). Covers step wiring + transports + job lifecycle + callback success/failure + pg queue notify. Run against a real postgres service container in Forgejo Actions (mammon CI pattern).
  3. E2E smoke against deployed app: scripts/e2e_smoke.py on the Mac calls POST http://192.168.68.42:8994/jobs with a redacted bank-statement fixture (tests/fixtures/dkb_giro_2026_03.pdf), polls GET /jobs/{id} until done, asserts:
    • status == "done"
    • provenance.fields["result.closing_balance"].provenance_verified is True
    • text_agreement is True when Paperless-style texts are submitted
    • Timings under 60 s Runs after every git push server main as the deploy gate. If it fails, the commit is reverted before merging the deploy PR.

13. Mammon integration (sketch — outside this spec's scope, owned by mammon)

Belongs in a mammon-side follow-up spec. Captured here only so readers of ix know the MVP's first consumer.

  • Paperless poller keeps current behavior for format-matched docs.
  • For needs_parser docs: submit to ix (use_case="bank_statement_header", files=[paperless_download_url], texts=[paperless_content]).
  • ix job id recorded on the Import row. A new poller on the mammon side checks GET /jobs/{id} until done.
  • Result is staged (new pending_headers table — not StatementBalance). A new "Investigate" panel surfaces staged headers with per-field provenance_verified + text_agreement flags.
  • User confirms → write to StatementBalance. Over time, when a deterministic parser is added for the bank, compare ix's past extractions against the deterministic output to measure ix accuracy.

14. Deferred from full spec (explicit)

  • Kafka transport (§15)
  • Config Server (§9.1 in full spec, §10 here): use cases are in-repo for MVP
  • Azure DI / Computer Vision OCR backends
  • OpenAI, Anthropic, AWS Bedrock GenAI backends
  • S3 adapter
  • use_vision + vision scaling/detail
  • Word-level provenance granularity
  • reasoning_effort parameter routing
  • Prometheus exporter (/metrics stays JSON for MVP)
  • OTEL gRPC exporter (spans present, no exporter)
  • Legacy aliases (prompt_template_base, kwargs_use_case)
  • Second-opinion multi-model ensembling
  • Schema version field
  • Per-request rate limiting

Every deferred item is additive: the OCRClient / GenAIClient / transport-adapter interfaces already leave the plug points, and the pipeline core is unaware of which implementation is in use.

15. Implementation workflow (habit reminder)

Every unit of work follows the cross-project habit:

  1. git checkout -b feat/<name>
  2. TDD: write failing test, write code, green, refactor
  3. Commit in small logical chunks; update AGENTS.md / README.md / docs/ in the same commit as the code
  4. git push forgejo feat/<name>
  5. Create PR via Forgejo API
  6. Wait for tests to pass
  7. Merge
  8. git push server main to deploy; run scripts/e2e_smoke.py against the live service

Never skip hooks, never force-push main, never bypass tests.