infoxtractor/docs/superpowers/specs/2026-04-18-ix-mvp-design.md
Dirk Riemann 5e007b138d Address spec review — auth, timeouts, lifecycle, error codes
- FileRef type added so callers (mammon/Paperless) can pass Authorization
  headers alongside URLs. context.files is now list[str | FileRef].
- Job lifecycle state machine pinned down, including worker-startup sweep
  for rows stuck in 'running' after a crash.
- Explicit IX_002_000 / IX_002_001 codes for Ollama unreachable and
  structured-output schema violations, with per-call timeout
  IX_GENAI_CALL_TIMEOUT_SECONDS distinct from the per-job timeout.
- IX_000_007 code for file-fetch failures; per-file size, connect, and
  read timeouts configurable via env.
- ReliabilityStep: Literal-typed fields and None values explicitly skipped
  from provenance verification (with reason); dates parse both sides
  before ISO comparison.
- /healthz semantics pinned down (CUDA + Surya loaded; Ollama reachable
  AND model available). /metrics window is last 24h.
- (client_id, request_id) is UNIQUE in ix_jobs, matching the idempotency
  claim.
- Deploy-failure workflow uses `git revert` forward commit, not
  force-push — aligned with AGENTS.md habits.
- Dockerfile / compose require --gpus all. Pre-deploy requires
  `ollama pull gpt-oss:20b`; /healthz verifies before deploy completes.
- CI clarified: Forgejo Actions runners are GPU-less and LAN-disconnected;
  all inference is stubbed there. Real-Ollama tests behind IX_TEST_OLLAMA=1.
- Fixture redaction stance: synthetic-template PDF committed; real
  redacted fixtures live out-of-repo.
- Deferred list picks up use_case URL/Base64, callback retries,
  multi-container workers. quality_metrics retains reference-spec counters
  plus the two new MVP ones.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 10:28:43 +02:00

455 lines
31 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# InfoXtractor (ix) MVP — Design
Date: 2026-04-18
Reference: `docs/spec-core-pipeline.md` (full, aspirational spec — MVP is a strict subset)
Status: Design approved (sections 18 walked through and accepted 2026-04-18)
## 0. One-paragraph summary
ix is an on-prem, async, LLM-powered microservice that extracts structured JSON from documents (PDFs, images, text) given a named *use case* (a Pydantic schema + system prompt). It returns the extracted fields together with per-field provenance (OCR segment IDs, bounding boxes, extracted-value agreement flags) that let calling services decide how much to trust each value. The MVP ships one use case (`bank_statement_header`), one OCR engine (Surya, pluggable), one LLM backend (Ollama, pluggable), and two transports in parallel (REST with optional webhook callback + a Postgres queue). Cloud services are explicitly forbidden. The first consumer is mammon, which uses ix as a fallback for `needs_parser` documents.
## 1. Guiding principles
- **On-prem always.** No OpenAI, Anthropic, Azure (DI/CV/OpenAI), AWS (Bedrock/Textract), Google Document AI, Mistral, etc. LLM = Ollama (:11434). OCR = local engines only. Secrets never leave the home server. The spec's cloud references are examples to replace, not inherit.
- **Grounded extraction, not DB truth.** ix returns best-effort fields with segment citations, provenance verification, and cross-OCR agreement flags. ix does *not* claim DB-grade truth. The reliability decision (trust / stage for review / reject) belongs to the caller.
- **Transport-agnostic pipeline core.** The pipeline (`RequestIX` → `ResponseIX`) knows nothing about HTTP, queues, or databases. Adapters (REST, Postgres queue) run alongside the core; both converge on one shared job store.
- **YAGNI for all spec features the MVP doesn't need.** Kafka, Config Server, Azure/AWS clients, vision, word-level provenance, reasoning-effort routing, Prometheus/OTEL exporters: deferred. Architecture leaves the interfaces so they can be added without touching the pipeline core.
## 2. Architecture
```
┌──────────────────────────────────────────────────────────────────┐
│ infoxtractor container (Docker on 192.168.68.42, port 8994) │
│ │
│ ┌──────────────────┐ ┌──────────────────────────┐ │
│ │ rest_adapter │ │ pg_queue_adapter │ │
│ │ (FastAPI) │ │ (asyncio worker) │ │
│ │ POST /jobs │ │ LISTEN ix_jobs_new + │ │
│ │ GET /jobs/{id} │ │ SELECT ... FOR UPDATE │ │
│ │ + callback_url │ │ SKIP LOCKED │ │
│ └────────┬─────────┘ └────────┬─────────────────┘ │
│ │ │ │
│ └──────────┬──────────────┘ │
│ ▼ │
│ ┌────────────────┐ │
│ │ ix_jobs table │ ── postgis :5431, DB=infoxtractor│
│ └────────┬───────┘ │
│ ▼ │
│ ┌─────────────────────────────┐ │
│ │ Pipeline core (spec §3§4) │ │
│ │ │ │
│ │ SetupStep → OCRStep → │ │
│ │ GenAIStep → ReliabilityStep│ │
│ │ → ResponseHandler │ │
│ │ │ │
│ │ Uses: OCRClient (Surya), │ │
│ │ GenAIClient (Ollama),│ │
│ │ UseCaseRegistry │ │
│ └─────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────┘
│ ▲
▼ │
host.docker.internal:11434 mammon or any on-prem caller —
Ollama (gpt-oss:20b default) polls GET /jobs/{id} until done
```
**Key shapes:**
- Spec's four steps + a new fifth: `ReliabilityStep` runs between `GenAIStep` and `ResponseHandlerStep`, computes per-field `provenance_verified` and `text_agreement` flags. Isolated so callers and tests can reason about reliability signals independently.
- Single worker at MVP (`PIPELINE_WORKER_CONCURRENCY=1`). Ollama + Surya both want the GPU serially.
- Two transports, one job store. REST is the primary; pg queue is scaffolded, uses the same table, same lifecycle.
- Use case registry in-repo (`ix/use_cases/__init__.py`); adding a new use case = new module + one registry line.
## 3. Data contracts
Subset of spec §2 / §9.3. Dropped fields are no-ops under the MVP's feature set.
### RequestIX
```python
class RequestIX(BaseModel):
use_case: str # registered name, e.g. "bank_statement_header"
ix_client_id: str # caller tag for logs/metrics, e.g. "mammon"
request_id: str # caller's correlation id; echoed back
ix_id: Optional[str] # caller MUST NOT set; transport assigns a 16-char hex id
context: Context
options: Options = Options()
callback_url: Optional[str] # optional webhook delivery (one-shot, no retry)
class Context(BaseModel):
files: list[Union[str, FileRef]] = [] # URLs, file:// paths, or FileRef objects (for auth headers)
texts: list[str] = [] # extra text (e.g. Paperless OCR output)
class FileRef(BaseModel):
"""Used when a file URL requires auth headers (e.g. Paperless Token auth) or per-file overrides."""
url: str # http(s):// or file://
headers: dict[str, str] = {} # e.g. {"Authorization": "Token …"}
max_bytes: Optional[int] = None # per-file override; defaults to IX_FILE_MAX_BYTES
class Options(BaseModel):
ocr: OCROptions = OCROptions()
gen_ai: GenAIOptions = GenAIOptions()
provenance: ProvenanceOptions = ProvenanceOptions()
class OCROptions(BaseModel):
use_ocr: bool = True
ocr_only: bool = False
include_ocr_text: bool = False
include_geometries: bool = False
service: Literal["surya"] = "surya" # kept so the adapter point is visible
class GenAIOptions(BaseModel):
gen_ai_model_name: Optional[str] = None # None → use-case default → IX_DEFAULT_MODEL
class ProvenanceOptions(BaseModel):
include_provenance: bool = True # default ON (reliability is the point)
max_sources_per_field: int = 10
```
**Dropped from spec (no-ops under MVP):** `OCROptions.computer_vision_scaling_factor`, `include_page_tags` (always on), `GenAIOptions.use_vision`/`vision_scaling_factor`/`vision_detail`/`reasoning_effort`, `ProvenanceOptions.granularity`/`include_bounding_boxes`/`source_type`/`min_confidence`, `RequestIX.version`.
### ResponseIX
Identical to spec §2.2 except `FieldProvenance` gains two fields:
```python
class FieldProvenance(BaseModel):
field_name: str
field_path: str
value: Any
sources: list[ExtractionSource]
confidence: Optional[float] = None # reserved; always None in MVP
provenance_verified: bool # NEW: cited segment actually contains value (normalized)
text_agreement: Optional[bool] # NEW: value appears in RequestIX.context.texts; None if no texts
```
`quality_metrics` gains two counters: `verified_fields`, `text_agreement_fields`.
### Job envelope (in `ix_jobs` table; returned by REST)
```python
class Job(BaseModel):
job_id: UUID
ix_id: str
client_id: str
request_id: str
status: Literal["pending", "running", "done", "error"]
request: RequestIX
response: Optional[ResponseIX]
callback_url: Optional[str]
callback_status: Optional[Literal["pending", "delivered", "failed"]]
attempts: int = 0
created_at: datetime
started_at: Optional[datetime]
finished_at: Optional[datetime]
```
### Job lifecycle state machine
```
POST /jobs (or INSERT+NOTIFY)
┌────────┐ worker claims ┌─────────┐ pipeline returns ┌──────┐
│pending │ ─────────────────▶ │ running │ ──────────────────▶ │ done │
└────────┘ └────┬────┘ (response.error └──────┘
▲ │ is None)
│ │
│ pipeline raised / │ response_ix.error set
│ pipeline returned │
│ response_ix.error ▼
│ ┌───────┐
│ │ error │
│ └───────┘
│ worker startup sweep: rows with status='running' AND
│ started_at < now() - 2 × IX_PIPELINE_REQUEST_TIMEOUT_SECONDS
│ are reset to 'pending' and attempts++
└───────────────────────────────────
```
- `status='done'` iff `Job.response.error is None`. Any non-None `error` in the response → `status='error'`. Both terminal states are stable; nothing moves out of them.
- Worker startup sweep protects against "row stuck in `running`" after a crash mid-job. Orphan detection is time-based (2× the per-job timeout), so a still-running worker never reclaims its own job.
- After terminal state, if `callback_url` is set, the worker makes one HTTP POST attempt and records `callback_status` (never changes `status`). Callback failure does not undo the terminal state.
## 4. Job store
```sql
CREATE DATABASE infoxtractor; -- on the existing postgis container
CREATE TABLE ix_jobs (
job_id UUID PRIMARY KEY,
ix_id TEXT NOT NULL,
client_id TEXT NOT NULL,
request_id TEXT NOT NULL,
status TEXT NOT NULL,
request JSONB NOT NULL,
response JSONB,
callback_url TEXT,
callback_status TEXT,
attempts INT NOT NULL DEFAULT 0,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
started_at TIMESTAMPTZ,
finished_at TIMESTAMPTZ
);
CREATE INDEX ix_jobs_status_created ON ix_jobs (status, created_at) WHERE status = 'pending';
CREATE UNIQUE INDEX ix_jobs_client_request ON ix_jobs (client_id, request_id);
-- Postgres NOTIFY channel used by the pg_queue_adapter: 'ix_jobs_new'
```
Callers that prefer direct SQL (the `pg_queue_adapter` contract): insert a row with `status='pending'`, then `NOTIFY ix_jobs_new, '<job_id>'`. The worker also falls back to a 10 s poll so a missed notify or ix restart doesn't strand a job.
## 5. REST surface
| Method | Path | Purpose |
|---|---|---|
| `POST` | `/jobs` | Body = `RequestIX` (+ optional `callback_url`). → `201 {job_id, ix_id, status: "pending"}`. Idempotent on `(ix_client_id, request_id)` — same pair returns the existing `job_id` with `200`. |
| `GET` | `/jobs/{job_id}` | → full `Job`. Source of truth regardless of submission path or callback outcome. |
| `GET` | `/jobs?client_id=…&request_id=…` | Lookup-by-correlation (caller idempotency helper). The pair is UNIQUE in the table → at most one match. Returns the job or `404`. |
| `GET` | `/healthz` | `{postgres, ollama, ocr}`. See below for semantics. Used by `infrastructure` monitoring dashboard. |
| `GET` | `/metrics` | Counters over the last 24 hours: `jobs_pending`, `jobs_running`, `jobs_done_24h`, `jobs_error_24h`, per-use-case avg seconds over the same window. Plain JSON, no Prometheus format for MVP. |
**`/healthz` semantics:**
- `postgres`: `SELECT 1` on the job store pool; `ok` iff the query returns within 2 s.
- `ollama`: `GET {IX_OLLAMA_URL}/api/tags` within 5 s; `ok` iff reachable AND the default model (`IX_DEFAULT_MODEL`) is listed in the tags response; `degraded` iff reachable but the model is missing (ops action: run `ollama pull <model>` on the host); `fail` on any other error.
- `ocr`: `SuryaOCRClient.selfcheck()` — returns `ok` iff CUDA is available and the Surya text-recognition model is loaded into GPU memory at process start. `fail` on any error.
- Overall HTTP status: `200` iff all three are `ok`; `503` otherwise. The monitoring dashboard only surfaces `200`/`non-200`.
**Callback delivery** (when `callback_url` is set): one POST of the full `Job` body, 10 s timeout. 2xx → `callback_status='delivered'`. Anything else → `'failed'`. No retry. Callers always have `GET /jobs/{id}` as the authoritative fallback.
## 6. Pipeline steps
Interface per spec §3 (`async validate` + `async process`). Pipeline orchestration per spec §4 (first error aborts; each step wrapped in a `Timer` landing in `Metadata.timings`).
### SetupStep
- **validate**: `request_ix` non-null; `context.files` or `context.texts` non-empty.
- **process**:
- Copy `request_ix.context.texts``response_ix.context.texts`.
- Normalize each `context.files` entry: plain `str``FileRef(url=str, headers={})`. `file://` URLs are read locally; `http(s)://` URLs are downloaded with the per-file `headers`.
- Download files to `/tmp/ix/<ix_id>/` in parallel (asyncio + httpx). Per-file: connect timeout 10 s, read timeout 30 s, size cap `min(FileRef.max_bytes, IX_FILE_MAX_BYTES)` (default 50 MB). Any fetch failure (non-2xx, timeout, size exceeded) → `IX_000_007` with the offending URL and cause in the message. No retry.
- MIME detection via `python-magic` on the downloaded bytes (do not trust URL extension). Supported: PDF (`application/pdf`), PNG (`image/png`), JPEG (`image/jpeg`), TIFF (`image/tiff`). Unsupported → `IX_000_005`.
- Load use case: `entry = REGISTRY.get(request_ix.use_case)`; if `None``IX_001_001`. Store `(use_case_request, use_case_response)` instances in `response_ix.context`. Echo `use_case_request.use_case_name``response_ix.use_case_name`.
- Build flat `response_ix.context.pages`: one entry per PDF page (via PyMuPDF), one per image frame, one per text entry. Hard cap 100 pages/PDF → `IX_000_006` on violation.
### OCRStep
- **validate**:
- If `(include_geometries or include_ocr_text or ocr_only) and not context.files` → raise `IX_000_004` (the caller asked for OCR artifacts but gave nothing to OCR).
- Else return `True` iff `(use_ocr or include_geometries or include_ocr_text or ocr_only) and context.files`. Otherwise `False` → step skipped (text-only requests).
- If `use_ocr=False` but any of `include_geometries`/`include_ocr_text`/`ocr_only` is set, OCR still runs — the flag triad controls what is *retained*, not whether OCR happens.
- **process**: `ocr_result = await OCRClient.ocr(context.pages)``response_ix.ocr_result`. Always inject `<page file="{item_index}" number="{page_no}">` tags (simplifies grounding). If `include_provenance`: build `SegmentIndex` (line granularity, normalized bboxes 0-1) and store in `context.segment_index`.
- **OCRClient interface**:
```python
class OCRClient(Protocol):
async def ocr(self, pages: list[Page]) -> OCRResult: ...
```
MVP implementation: `SuryaOCRClient` (GPU via `surya-ocr` PyPI package, CUDA on the RTX 3090).
### GenAIStep
- **validate**: `ocr_only``False` (skip). Use case must exist. OCR text or `context.texts` must be non-empty (else `IX_001_000`).
- **process**:
- System prompt = `use_case_request.system_prompt`. If `include_provenance`: append spec §9.2 citation instruction verbatim.
- User text: segment-tagged (`[p1_l0] …`) when provenance is on; plain concatenated OCR + texts otherwise.
- Response schema: `UseCaseResponse` directly, or the dynamic `ProvenanceWrappedResponse(result=..., segment_citations=...)` per spec §7.2e when provenance is on.
- Model: `request_ix.options.gen_ai.gen_ai_model_name``use_case_request.default_model``IX_DEFAULT_MODEL`.
- Call `GenAIClient.invoke(request_kwargs, response_schema)`; parsed model → `ix_result.result`, usage + model_name → `ix_result.meta_data`.
- If provenance: call `ProvenanceUtils.map_segment_refs_to_provenance(...)` per spec §9.4, write `response_ix.provenance`.
- **GenAIClient interface**:
```python
class GenAIClient(Protocol):
async def invoke(self, request_kwargs: dict, response_schema: type[BaseModel]) -> GenAIInvocationResult: ...
```
MVP implementation: `OllamaClient``POST {IX_OLLAMA_URL}/api/chat` with `format = <JSON schema from Pydantic>` (Ollama structured outputs). Per-call timeout: `IX_GENAI_CALL_TIMEOUT_SECONDS` (default 1500 s, distinct from the per-job timeout so a frozen model doesn't eat the full 45-minute budget).
- **Failure modes (no retry on MVP, both surface as pipeline error):**
- Connection refused / timeout / 5xx → `IX_002_000` ("inference backend unavailable") with model name + endpoint.
- 2xx response body cannot be parsed against the Pydantic schema (Ollama structured output violated the schema) → `IX_002_001` ("structured output parse failed") with a snippet of the offending body.
### ReliabilityStep (new; runs when `include_provenance` is True)
For each `FieldProvenance` in `response_ix.provenance.fields`:
- **`provenance_verified`**: for each cited segment, compare `text_snippet` to the extracted `value` after normalization (see below). If any cited segment agrees → `True`. Else `False`.
- **`text_agreement`**: if `request_ix.context.texts` is empty → `None`. Else run the same comparison against the concatenated texts → `True` / `False`.
**Per-field-type dispatch** (picks the comparator based on the Pydantic field annotation on the use-case response schema):
| Python type annotation | Comparator |
|---|---|
| `str` | String normalizer (NFKC, casefold, collapse whitespace, strip common punctuation); substring check |
| `int`, `float`, `Decimal` | Digits-and-sign only (strip currency symbols, thousands separators, decimal variants); exact match at 2 decimal places |
| `date`, `datetime` | Parse *both* sides with `dateutil.parser(dayfirst=True)`; compare as ISO strings |
| IBAN (str with `account_iban`-like names) | Upper-case, strip whitespace; exact match |
| `Literal[...]` | **Skipped** — verification is `None` (caller-controlled enum labels rarely appear verbatim in the source text). `text_agreement` also `None`. |
| `None` / unset value | **Skipped**`provenance_verified = None`, `text_agreement = None`. Field still appears in provenance output. |
**Short-value skip rule** (applies after comparator selection): if the stringified `value` is ≤ 2 chars, or a numeric `|value| < 10`, `text_agreement` is skipped (→ `None`). `provenance_verified` still runs — the bbox-anchored cite is stronger than a global text scan for short values.
Records are mutations to the provenance structure only; does **not** drop fields. Caller sees every extracted field + the flags.
Writes `quality_metrics.verified_fields` (count where `provenance_verified=True`) and `quality_metrics.text_agreement_fields` (count where `text_agreement=True`) summary counters; fields with `None` flags are not counted as either success or failure.
### ResponseHandlerStep
Per spec §8, unchanged. Attach flat OCR text when `include_ocr_text`; strip `ocr_result.pages` unless `include_geometries`; delete `context` before serialization.
## 7. Use case registry
```
ix/use_cases/
__init__.py # REGISTRY: dict[str, tuple[type[UseCaseRequest], type[UseCaseResponse]]]
bank_statement_header.py
```
Adding a use case = new module exporting a `Request(BaseModel)` (`use_case_name`, `default_model`, `system_prompt`) and a `UseCaseResponse(BaseModel)`, then one `REGISTRY["<name>"] = (Request, UseCaseResponse)` line.
### First use case: `bank_statement_header`
```python
class BankStatementHeader(BaseModel):
bank_name: str
account_iban: Optional[str]
account_type: Optional[Literal["checking", "credit", "savings"]]
currency: str
statement_date: Optional[date]
statement_period_start: Optional[date]
statement_period_end: Optional[date]
opening_balance: Optional[Decimal]
closing_balance: Optional[Decimal]
class Request(BaseModel):
use_case_name: str = "Bank Statement Header"
default_model: str = "gpt-oss:20b"
system_prompt: str = (
"You extract header metadata from a single bank or credit-card statement. "
"Return only facts that appear in the document; leave a field null if uncertain. "
"Balances must use the document's numeric format (e.g. '1234.56' or '-123.45'); "
"do not invent a currency symbol. Account type: 'checking' for current/Giro accounts, "
"'credit' for credit-card statements, 'savings' otherwise. Always return the IBAN "
"with spaces removed. Never fabricate a value to fill a required-looking field."
)
```
**Why these fields:** each appears at most once per document (one cite per field → strong `provenance_verified` signal); all reconcile against something mammon already stores (IBAN → `Account.iban`, period → verified-range chain, closing_balance → next month's opening_balance and `StatementBalance`); schema is flat (no nested arrays where Ollama structured output tends to drift).
## 8. Errors and warnings
Error-code set (spec §12.2 subset + MVP-specific codes for on-prem failure modes):
| Code | Trigger |
|---|---|
| `IX_000_000` | `request_ix` is None |
| `IX_000_002` | No context (neither files nor texts) |
| `IX_000_004` | `include_geometries`, `include_ocr_text`, or `ocr_only` set but `context.files` empty |
| `IX_000_005` | File MIME type not supported (after byte-sniffing) |
| `IX_000_006` | PDF page-count cap exceeded |
| `IX_000_007` | File fetch failed (connect / timeout / non-2xx / size cap exceeded) |
| `IX_001_000` | Extraction context empty after setup (OCR produced nothing AND `context.texts` empty) |
| `IX_001_001` | Use case name not in `REGISTRY` |
| `IX_002_000` | Inference backend unavailable (Ollama connect / timeout / 5xx) |
| `IX_002_001` | Structured output parse failed (Ollama response body didn't match schema) |
Warnings (non-fatal, appended to `response_ix.warning`): empty OCR result, provenance requested with `use_ocr=False`, requested model unavailable and falling back to `IX_DEFAULT_MODEL`, very short or Literal-typed field skipped during reliability check.
## 9. Configuration (`AppConfig` via `pydantic-settings`)
| Key env var | Default | Meaning |
|---|---|---|
| `IX_POSTGRES_URL` | `postgresql+asyncpg://infoxtractor:<password>@host.docker.internal:5431/infoxtractor` | Job store. Password must be set in `.env`; `.env.example` ships with `<password>` as a placeholder. |
| `IX_OLLAMA_URL` | `http://host.docker.internal:11434` | LLM backend |
| `IX_DEFAULT_MODEL` | `gpt-oss:20b` | Fallback model |
| `IX_OCR_ENGINE` | `surya` | Adapter selector (only value in MVP) |
| `IX_TMP_DIR` | `/tmp/ix` | Download scratch |
| `IX_PIPELINE_WORKER_CONCURRENCY` | `1` | Worker semaphore cap |
| `IX_PIPELINE_REQUEST_TIMEOUT_SECONDS` | `2700` | Per-job timeout (45 min) |
| `IX_GENAI_CALL_TIMEOUT_SECONDS` | `1500` | Per-LLM-call timeout (distinct from per-job) |
| `IX_FILE_MAX_BYTES` | `52428800` | Default per-file download size cap (50 MB) |
| `IX_FILE_CONNECT_TIMEOUT_SECONDS` | `10` | Per-file connect timeout |
| `IX_FILE_READ_TIMEOUT_SECONDS` | `30` | Per-file read timeout |
| `IX_RENDER_MAX_PIXELS_PER_PAGE` | `75000000` | Per-page render cap |
| `IX_LOG_LEVEL` | `INFO` | |
| `IX_CALLBACK_TIMEOUT_SECONDS` | `10` | Webhook POST timeout |
No Azure, OpenAI, or AWS variables — those paths do not exist in the codebase.
## 10. Observability (minimal)
- **Logs**: JSON-structured via `logging` + custom formatter. Every line carries `ix_id`, `client_id`, `request_id`, `use_case`. Steps emit `step_start` / `step_end` events with elapsed ms.
- **Timings**: every step's elapsed-seconds recorded in `response_ix.metadata.timings` (same shape as spec §2).
- **Traces**: OpenTelemetry span scaffolding present, no exporter wired. Drop-in later.
- **Prometheus**: deferred.
## 11. Deployment
- Repo: `goldstein/infoxtractor` on Forgejo, plus `server` bare-repo remote with `post-receive` hook mirroring mammon.
- Port 8994 (LAN-only via UFW; not exposed publicly — internal service). No `infrastructure.docs_url` label, no VPS Caddy entry.
- Postgres: new `infoxtractor` database on existing postgis container.
- Ollama reached via `host.docker.internal:11434`.
- Monitoring label: `infrastructure.web_url=http://192.168.68.42:8994`.
- Backup: `backup.enable=true`, `backup.type=postgres`, `backup.name=infoxtractor`.
- Dockerfile: CUDA-enabled base (`nvidia/cuda:12.4-runtime-ubuntu22.04` + Python 3.12) so Surya can use the 3090. CMD: `alembic upgrade head && uvicorn ix.app:create_app --factory --host 0.0.0.0 --port 8994`.
- Docker Compose gives the container GPU access: `runtime: nvidia` + a `deploy.resources.reservations` GPU entry (same shape as Immich ML / monitoring). The `docker run` equivalent used by post-receive hooks must include `--gpus all`.
- **Pre-deploy check:** the host must have `gpt-oss:20b` pulled into Ollama before first deploy (`ollama pull gpt-oss:20b`). If the model is missing at startup, `/healthz` returns `503` with `ollama: "degraded"` and the monitoring dashboard surfaces the failure. The `post-receive` hook probes `/healthz` for 60 s after container restart; a `503` that doesn't resolve fails the deploy.
## 12. Testing strategy
Strict TDD — each unit is written test-first.
1. **Unit tests** (fast, hermetic): every `Step`, `SegmentIndex`, provenance-verification normalizers, `OCRClient` contract, `GenAIClient` contract, error mapping. No DB, no Ollama, no network.
2. **Integration tests** (DB + fakes): pipeline end-to-end with stub `OCRClient` (replays canned OCR results) and stub `GenAIClient` (replays canned LLM JSON). Covers step wiring + transports + job lifecycle + callback success/failure + pg queue notify + worker startup orphan-sweep. Run against a real postgres service container in Forgejo Actions (mammon CI pattern). **Forgejo Actions runners have neither GPU nor network access to the LAN Ollama/Surya instances; all inference in CI is stubbed.** Real-Ollama tests are gated behind `IX_TEST_OLLAMA=1` and run only from the Mac.
3. **E2E smoke against deployed app**: `scripts/e2e_smoke.py` on the Mac calls `POST http://192.168.68.42:8994/jobs` with a **synthetic** bank-statement fixture (`tests/fixtures/synthetic_giro.pdf` — generated from a template, no real personal data; a separate redacted-real-statement fixture lives outside git at `~/ix-fixtures/` if needed), polls `GET /jobs/{id}` until done, asserts:
- `status == "done"`
- `provenance.fields["result.closing_balance"].provenance_verified is True`
- `text_agreement is True` when Paperless-style texts are submitted
- Timings under 60 s
Runs after every `git push server main` as the deploy gate. **Deploy-failure workflow:** if the smoke test fails, `git revert HEAD` creates a forward-commit that undoes the change, then push that revert commit to both `forgejo` and `server`. Never force-push to `main`; never rewrite history on deployed commits.
## 13. Mammon integration (sketch — outside this spec's scope, owned by mammon)
Belongs in a mammon-side follow-up spec. Captured here only so readers of ix know the MVP's first consumer.
- Paperless poller keeps current behavior for format-matched docs.
- For `needs_parser` docs: submit to ix (`use_case="bank_statement_header"`, `files=[paperless_download_url]`, `texts=[paperless_content]`).
- ix job id recorded on the `Import` row. A new poller on the mammon side checks `GET /jobs/{id}` until done.
- Result is staged (new `pending_headers` table — not `StatementBalance`). A new "Investigate" panel surfaces staged headers with per-field `provenance_verified` + `text_agreement` flags.
- User confirms → write to `StatementBalance`. Over time, when a deterministic parser is added for the bank, compare ix's past extractions against the deterministic output to measure ix accuracy.
## 14. Deferred from full spec (explicit)
- Kafka transport (§15)
- Config Server (§9.1 in full spec, §10 here): use cases are in-repo for MVP
- `use_case` as URL or Base64-encoded definition (MVP accepts only registered-name strings)
- Azure DI / Computer Vision OCR backends
- OpenAI, Anthropic, AWS Bedrock GenAI backends
- S3 adapter
- `use_vision` + vision scaling/detail
- Word-level provenance granularity
- `reasoning_effort` parameter routing
- Prometheus exporter (/metrics stays JSON for MVP)
- OTEL gRPC exporter (spans present, no exporter)
- Legacy aliases (`prompt_template_base`, `kwargs_use_case`)
- Second-opinion multi-model ensembling
- Schema `version` field
- Per-request rate limiting
- Callback retries (one-shot only for MVP; callers poll as fallback)
- Multi-container workers (single worker in MVP; the `FOR UPDATE SKIP LOCKED` claim pattern is ready for horizontal scale when needed)
The `quality_metrics` shape retains the reference-spec counters (`fields_with_provenance`, `total_fields`, `coverage_rate`, `invalid_references`) and adds the two MVP counters (`verified_fields`, `text_agreement_fields`).
Every deferred item is additive: the `OCRClient` / `GenAIClient` / transport-adapter interfaces already leave the plug points, and the pipeline core is unaware of which implementation is in use.
## 15. Implementation workflow (habit reminder)
Every unit of work follows the cross-project habit:
1. `git checkout -b feat/<name>`
2. TDD: write failing test, write code, green, refactor
3. Commit in small logical chunks; update `AGENTS.md` / `README.md` / `docs/` in the same commit as the code
4. `git push forgejo feat/<name>`
5. Create PR via Forgejo API
6. Wait for tests to pass
7. Merge
8. `git push server main` to deploy; run `scripts/e2e_smoke.py` against the live service
Never skip hooks, never force-push main, never bypass tests.