# InfoXtractor (ix) Async, on-prem, LLM-powered structured information extraction microservice. Given a document (PDF, image) or text plus a use case (a Pydantic schema), returns a structured JSON result with per-field provenance (source page, bounding box, OCR segment). Designed to be used by other on-prem services (e.g. mammon) as a reliable fallback / second opinion for format-specific deterministic parsers. Status: MVP deployed (2026-04-18) at `http://192.168.68.42:8994` — LAN only. Full reference spec at `docs/spec-core-pipeline.md`; MVP spec at `docs/superpowers/specs/2026-04-18-ix-mvp-design.md`; deploy runbook at `docs/deployment.md`. ## Guiding Principles - **On-prem always.** All LLM inference, OCR, and user-data processing run on the home server (192.168.68.42). No cloud APIs — OpenAI, Anthropic, Azure, AWS Bedrock/Textract, Google Document AI, Mistral, etc. are not to be used for user data or inference. LLM backend is Ollama (:11434); OCR runs locally (pluggable `OCRClient` interface, first engine: Surya on the RTX 3090); job state lives in local Postgres on the postgis container. The spec's references to Azure / AWS / OpenAI are examples to *replace*, not inherit. - **Grounded extraction, not DB truth.** ix returns best-effort extracted fields with segment citations, provenance, and cross-OCR agreement signals. ix does *not* claim its output is DB-grade; the calling service (e.g. mammon) owns the reliability decision (reconcile against anchors, stage for review, compare to deterministic parsers). - **Transport-agnostic pipeline core.** The pipeline (`RequestIX` → `ResponseIX`) knows nothing about HTTP, queues, or databases. Transport adapters (REST, Postgres queue, …) run in parallel alongside the core and all converge on one job store. ## Habits - **Feature branches + PRs.** New work: `git checkout -b feat/` → commit small, logical chunks → `git push forgejo feat/` → create PR via Forgejo API → **wait for tests to pass** → merge → `git push server main` to deploy. - **Keep documentation up to date in the same commit as the code.** `README.md`, `docs/`, and `AGENTS.md` update alongside the change. Unpushed / undocumented work is work that isn't done. - **Deploy after merging.** `git push server main` rebuilds the Docker image via `post-receive` and restarts the container. Smoke-test the live service before walking away. - **Never skip hooks** (`--no-verify`, etc.) without explicit user approval. Prefer creating new commits over amending. Never force-push `main`. - **Forgejo**: repo at `http://192.168.68.42:3030/goldstein/infoxtractor` (to be created). Use basic auth with `FORGEJO_USR` / `FORGEJO_PSD` from `~/Projects/infrastructure/.env`, or an API token once issued for this repo. ## Tech Stack (MVP) - **Language**: Python 3.12, asyncio - **Web/REST**: FastAPI + uvicorn - **OCR (pluggable)**: Surya OCR first (GPU, shares RTX 3090 with Ollama / Immich ML) - **LLM**: Ollama at `192.168.68.42:11434`, structured outputs via JSON schema. Initial model candidate: `qwen2.5:32b` / `qwen3:14b`, configurable per use case - **State**: Postgres on the shared `postgis` container (:5431), new `infoxtractor` database - **Deployment**: Docker, `git push server main` → post-receive rebuild (pattern from other apps) ## Repository / Deploy - Git remotes: - `forgejo`: `ssh://git@192.168.68.42:2222/goldstein/infoxtractor.git` (source of truth / PRs) - `server`: bare repo with `post-receive` rebuild hook (to be set up) - Workflow: feat branch → `git push forgejo feat/name` → PR via Forgejo API → merge → `git push server main` to deploy - Monitoring label: `infrastructure.web_url=http://192.168.68.42:` - Backup opt-in: `backup.enable=true` label on the container ## Related Projects - **mammon** (`../mammon`) — first consumer. Uses ix as a fallback / second opinion for Paperless-imported bank statements where deterministic parsers don't match. - **infrastructure** (`../infrastructure`) — server topology, deployment pattern, Ollama setup, shared `postgis` Postgres.