infoxtractor/AGENTS.md
Dirk Riemann 2e8ca0ee43
All checks were successful
tests / test (push) Successful in 1m43s
tests / test (pull_request) Successful in 1m21s
feat(ui): add browser UI at /ui for job submission
Minimal Jinja2 + HTMX + Pico CSS UI (all CDN, no build step) that lets
a user drop a PDF, pick a registered use case or define one inline,
tweak OCR/GenAI/provenance options, submit, and watch the pretty-JSON
result come back via 2s HTMX polling. Uploads land in
{tmp_dir}/ui/<uuid>.pdf via aiofiles streaming with the existing
IX_FILE_MAX_BYTES cap.

All submissions go through the same jobs_repo.insert_pending entry
point the REST adapter uses — no duplicated logic. The REST surface is
unchanged.

Tests: tests/integration/test_ui_routes.py — 8 cases covering GET /ui,
registered + custom use-case submissions (asserting the stored request
carries use_case_inline for the custom path), malformed fields_json
rejection, and the fragment renderer for pending vs. done.

New deps pinned explicitly in pyproject.toml:
jinja2, aiofiles, python-multipart (arrive transitively via FastAPI but
we own the import surface now).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:27:54 +02:00

4.5 KiB

InfoXtractor (ix)

Async, on-prem, LLM-powered structured information extraction microservice. Given a document (PDF, image) or text plus a use case (a Pydantic schema), returns a structured JSON result with per-field provenance (source page, bounding box, OCR segment).

Designed to be used by other on-prem services (e.g. mammon) as a reliable fallback / second opinion for format-specific deterministic parsers.

Status: MVP deployed (2026-04-18) at http://192.168.68.42:8994 — LAN only. Browser UI at http://192.168.68.42:8994/ui. Full reference spec at docs/spec-core-pipeline.md; MVP spec at docs/superpowers/specs/2026-04-18-ix-mvp-design.md; deploy runbook at docs/deployment.md.

Use cases: the built-in registry lives in src/ix/use_cases/__init__.py (bank_statement_header for MVP). Callers without a registered entry can ship an ad-hoc schema inline via RequestIX.use_case_inline (see README "Ad-hoc use cases"); the pipeline builds the Pydantic classes on the fly per request. The /ui page exposes this as a "custom" option so non-engineering users can experiment without a deploy.

Guiding Principles

  • On-prem always. All LLM inference, OCR, and user-data processing run on the home server (192.168.68.42). No cloud APIs — OpenAI, Anthropic, Azure, AWS Bedrock/Textract, Google Document AI, Mistral, etc. are not to be used for user data or inference. LLM backend is Ollama (:11434); OCR runs locally (pluggable OCRClient interface, first engine: Surya on the RTX 3090); job state lives in local Postgres on the postgis container. The spec's references to Azure / AWS / OpenAI are examples to replace, not inherit.
  • Grounded extraction, not DB truth. ix returns best-effort extracted fields with segment citations, provenance, and cross-OCR agreement signals. ix does not claim its output is DB-grade; the calling service (e.g. mammon) owns the reliability decision (reconcile against anchors, stage for review, compare to deterministic parsers).
  • Transport-agnostic pipeline core. The pipeline (RequestIXResponseIX) knows nothing about HTTP, queues, or databases. Transport adapters (REST, Postgres queue, …) run in parallel alongside the core and all converge on one job store.

Habits

  • Feature branches + PRs. New work: git checkout -b feat/<name> → commit small, logical chunks → git push forgejo feat/<name> → create PR via Forgejo API → wait for tests to pass → merge → git push server main to deploy.
  • Keep documentation up to date in the same commit as the code. README.md, docs/, and AGENTS.md update alongside the change. Unpushed / undocumented work is work that isn't done.
  • Deploy after merging. git push server main rebuilds the Docker image via post-receive and restarts the container. Smoke-test the live service before walking away.
  • Never skip hooks (--no-verify, etc.) without explicit user approval. Prefer creating new commits over amending. Never force-push main.
  • Forgejo: repo at http://192.168.68.42:3030/goldstein/infoxtractor (to be created). Use basic auth with FORGEJO_USR / FORGEJO_PSD from ~/Projects/infrastructure/.env, or an API token once issued for this repo.

Tech Stack (MVP)

  • Language: Python 3.12, asyncio
  • Web/REST: FastAPI + uvicorn
  • OCR (pluggable): Surya OCR first (GPU, shares RTX 3090 with Ollama / Immich ML)
  • LLM: Ollama at 192.168.68.42:11434, structured outputs via JSON schema. Initial model candidate: qwen2.5:32b / qwen3:14b, configurable per use case
  • State: Postgres on the shared postgis container (:5431), new infoxtractor database
  • Deployment: Docker, git push server main → post-receive rebuild (pattern from other apps)

Repository / Deploy

  • Git remotes:
    • forgejo: ssh://git@192.168.68.42:2222/goldstein/infoxtractor.git (source of truth / PRs)
    • server: bare repo with post-receive rebuild hook (to be set up)
  • Workflow: feat branch → git push forgejo feat/name → PR via Forgejo API → merge → git push server main to deploy
  • Monitoring label: infrastructure.web_url=http://192.168.68.42:<PORT>
  • Backup opt-in: backup.enable=true label on the container
  • mammon (../mammon) — first consumer. Uses ix as a fallback / second opinion for Paperless-imported bank statements where deterministic parsers don't match.
  • infrastructure (../infrastructure) — server topology, deployment pattern, Ollama setup, shared postgis Postgres.