infoxtractor/AGENTS.md
Dirk Riemann 842c4da90c
All checks were successful
tests / test (push) Successful in 1m16s
tests / test (pull_request) Successful in 1m12s
chore: MVP deployed — readme, AGENTS.md status, deploy runbook filled in
First deploy done 2026-04-18. E2E extraction of the bank_statement_header
use case completes in 35 s against the live service, with 7 of 9 header
fields provenance-verified + text-agreement-green. closing_balance
asserts from spec §12 all pass.

Updates:
- README.md: status -> "MVP deployed"; worked example curl snippet;
  pointers to deployment runbook + spec + plan.
- AGENTS.md: status line updated with the live URL + date.
- pyproject.toml: version comment referencing the first deploy.
- docs/deployment.md: "First deploy" section filled in with times,
  field-level extraction result, plus a log of every small Docker/ops
  follow-up PR that had to land to make the first deploy healthy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 14:08:07 +02:00

4 KiB

InfoXtractor (ix)

Async, on-prem, LLM-powered structured information extraction microservice. Given a document (PDF, image) or text plus a use case (a Pydantic schema), returns a structured JSON result with per-field provenance (source page, bounding box, OCR segment).

Designed to be used by other on-prem services (e.g. mammon) as a reliable fallback / second opinion for format-specific deterministic parsers.

Status: MVP deployed (2026-04-18) at http://192.168.68.42:8994 — LAN only. Full reference spec at docs/spec-core-pipeline.md; MVP spec at docs/superpowers/specs/2026-04-18-ix-mvp-design.md; deploy runbook at docs/deployment.md.

Guiding Principles

  • On-prem always. All LLM inference, OCR, and user-data processing run on the home server (192.168.68.42). No cloud APIs — OpenAI, Anthropic, Azure, AWS Bedrock/Textract, Google Document AI, Mistral, etc. are not to be used for user data or inference. LLM backend is Ollama (:11434); OCR runs locally (pluggable OCRClient interface, first engine: Surya on the RTX 3090); job state lives in local Postgres on the postgis container. The spec's references to Azure / AWS / OpenAI are examples to replace, not inherit.
  • Grounded extraction, not DB truth. ix returns best-effort extracted fields with segment citations, provenance, and cross-OCR agreement signals. ix does not claim its output is DB-grade; the calling service (e.g. mammon) owns the reliability decision (reconcile against anchors, stage for review, compare to deterministic parsers).
  • Transport-agnostic pipeline core. The pipeline (RequestIXResponseIX) knows nothing about HTTP, queues, or databases. Transport adapters (REST, Postgres queue, …) run in parallel alongside the core and all converge on one job store.

Habits

  • Feature branches + PRs. New work: git checkout -b feat/<name> → commit small, logical chunks → git push forgejo feat/<name> → create PR via Forgejo API → wait for tests to pass → merge → git push server main to deploy.
  • Keep documentation up to date in the same commit as the code. README.md, docs/, and AGENTS.md update alongside the change. Unpushed / undocumented work is work that isn't done.
  • Deploy after merging. git push server main rebuilds the Docker image via post-receive and restarts the container. Smoke-test the live service before walking away.
  • Never skip hooks (--no-verify, etc.) without explicit user approval. Prefer creating new commits over amending. Never force-push main.
  • Forgejo: repo at http://192.168.68.42:3030/goldstein/infoxtractor (to be created). Use basic auth with FORGEJO_USR / FORGEJO_PSD from ~/Projects/infrastructure/.env, or an API token once issued for this repo.

Tech Stack (MVP)

  • Language: Python 3.12, asyncio
  • Web/REST: FastAPI + uvicorn
  • OCR (pluggable): Surya OCR first (GPU, shares RTX 3090 with Ollama / Immich ML)
  • LLM: Ollama at 192.168.68.42:11434, structured outputs via JSON schema. Initial model candidate: qwen2.5:32b / qwen3:14b, configurable per use case
  • State: Postgres on the shared postgis container (:5431), new infoxtractor database
  • Deployment: Docker, git push server main → post-receive rebuild (pattern from other apps)

Repository / Deploy

  • Git remotes:
    • forgejo: ssh://git@192.168.68.42:2222/goldstein/infoxtractor.git (source of truth / PRs)
    • server: bare repo with post-receive rebuild hook (to be set up)
  • Workflow: feat branch → git push forgejo feat/name → PR via Forgejo API → merge → git push server main to deploy
  • Monitoring label: infrastructure.web_url=http://192.168.68.42:<PORT>
  • Backup opt-in: backup.enable=true label on the container
  • mammon (../mammon) — first consumer. Uses ix as a fallback / second opinion for Paperless-imported bank statements where deterministic parsers don't match.
  • infrastructure (../infrastructure) — server topology, deployment pattern, Ollama setup, shared postgis Postgres.