infoxtractor/AGENTS.md

# InfoXtractor (ix)

Async, on-prem, LLM-powered structured information extraction microservice. Given a document (PDF, image) or text plus a use case (a Pydantic schema), returns a structured JSON result with per-field provenance (source page, bounding box, OCR segment).

Designed to be used by other on-prem services (e.g. mammon) as a reliable fallback / second opinion for format-specific deterministic parsers.

Status: MVP deployed (2026-04-18) at `http://192.168.68.42:8994` — LAN only. Browser UI at `http://192.168.68.42:8994/ui`. Full reference spec at `docs/spec-core-pipeline.md`; MVP spec at `docs/superpowers/specs/2026-04-18-ix-mvp-design.md`; deploy runbook at `docs/deployment.md`.

Use cases: the built-in registry lives in `src/ix/use_cases/__init__.py` (`bank_statement_header` for MVP). Callers without a registered entry can ship an ad-hoc schema inline via `RequestIX.use_case_inline` (see README "Ad-hoc use cases"); the pipeline builds the Pydantic classes on the fly per request. The `/ui` page exposes this as a "custom" option so non-engineering users can experiment without a deploy.

UX notes: the `/ui` job page surfaces queue position + elapsed MM:SS on each poll, renders the client-provided filename (stored via `FileRef.display_name`, optional metadata — the pipeline ignores it for execution), and shows a CPU-mode notice when `/healthz` reports `ocr_gpu: false`.

## Guiding Principles

- **On-prem always.** All LLM inference, OCR, and user-data processing run on the home server (192.168.68.42). No cloud APIs — OpenAI, Anthropic, Azure, AWS Bedrock/Textract, Google Document AI, Mistral, etc. are not to be used for user data or inference. LLM backend is Ollama (:11434); OCR runs locally (pluggable `OCRClient` interface, first engine: Surya on the RTX 3090); job state lives in local Postgres on the postgis container. The spec's references to Azure / AWS / OpenAI are examples to *replace*, not inherit.
- **Grounded extraction, not DB truth.** ix returns best-effort extracted fields with segment citations, provenance, and cross-OCR agreement signals. ix does *not* claim its output is DB-grade; the calling service (e.g. mammon) owns the reliability decision (reconcile against anchors, stage for review, compare to deterministic parsers).
- **Transport-agnostic pipeline core.** The pipeline (`RequestIX` → `ResponseIX`) knows nothing about HTTP, queues, or databases. Transport adapters (REST, Postgres queue, …) run in parallel alongside the core and all converge on one job store.

## Habits

- **Feature branches + PRs.** New work: `git checkout -b feat/<name>` → commit small, logical chunks → `git push forgejo feat/<name>` → create PR via Forgejo API → **wait for tests to pass** → merge → `git push server main` to deploy.
- **Keep documentation up to date in the same commit as the code.** `README.md`, `docs/`, and `AGENTS.md` update alongside the change. Unpushed / undocumented work is work that isn't done.
- **Deploy after merging.** `git push server main` rebuilds the Docker image via `post-receive` and restarts the container. Smoke-test the live service before walking away.
- **Never skip hooks** (`--no-verify`, etc.) without explicit user approval. Prefer creating new commits over amending. Never force-push `main`.
- **Forgejo**: repo at `http://192.168.68.42:3030/goldstein/infoxtractor` (to be created). Use basic auth with `FORGEJO_USR` / `FORGEJO_PSD` from `~/Projects/infrastructure/.env`, or an API token once issued for this repo.

## Tech Stack (MVP)

- **Language**: Python 3.12, asyncio
- **Web/REST**: FastAPI + uvicorn
- **OCR (pluggable)**: Surya OCR first (GPU, shares RTX 3090 with Ollama / Immich ML)
- **LLM**: Ollama at `192.168.68.42:11434`, structured outputs via JSON schema. Initial model candidate: `qwen2.5:32b` / `qwen3:14b`, configurable per use case
- **State**: Postgres on the shared `postgis` container (:5431), new `infoxtractor` database
- **Deployment**: Docker, `git push server main` → post-receive rebuild (pattern from other apps)

## Repository / Deploy

- Git remotes:
  - `forgejo`: `ssh://git@192.168.68.42:2222/goldstein/infoxtractor.git` (source of truth / PRs)
  - `server`: bare repo with `post-receive` rebuild hook (to be set up)
- Workflow: feat branch → `git push forgejo feat/name` → PR via Forgejo API → merge → `git push server main` to deploy
- Monitoring label: `infrastructure.web_url=http://192.168.68.42:<PORT>`
- Backup opt-in: `backup.enable=true` label on the container

## Related Projects

- **mammon** (`../mammon`) — first consumer. Uses ix as a fallback / second opinion for Paperless-imported bank statements where deterministic parsers don't match.
- **infrastructure** (`../infrastructure`) — server topology, deployment pattern, Ollama setup, shared `postgis` Postgres.