Dirk Riemann 842c4da90c

tests / test (push) Successful in 1m16s

Details

tests / test (pull_request) Successful in 1m12s

Details

chore: MVP deployed — readme, AGENTS.md status, deploy runbook filled in

First deploy done 2026-04-18. E2E extraction of the bank_statement_header
use case completes in 35 s against the live service, with 7 of 9 header
fields provenance-verified + text-agreement-green. closing_balance
asserts from spec §12 all pass.

Updates:
- README.md: status -> "MVP deployed"; worked example curl snippet;
  pointers to deployment runbook + spec + plan.
- AGENTS.md: status line updated with the live URL + date.
- pyproject.toml: version comment referencing the first deploy.
- docs/deployment.md: "First deploy" section filled in with times,
  field-level extraction result, plus a log of every small Docker/ops
  follow-up PR that had to land to make the first deploy healthy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-18 14:08:07 +02:00

4 KiB

Raw Permalink Blame History

InfoXtractor (ix)

Async, on-prem, LLM-powered structured information extraction microservice. Given a document (PDF, image) or text plus a use case (a Pydantic schema), returns a structured JSON result with per-field provenance (source page, bounding box, OCR segment).

Designed to be used by other on-prem services (e.g. mammon) as a reliable fallback / second opinion for format-specific deterministic parsers.

Status: MVP deployed (2026-04-18) at http://192.168.68.42:8994 — LAN only. Full reference spec at docs/spec-core-pipeline.md; MVP spec at docs/superpowers/specs/2026-04-18-ix-mvp-design.md; deploy runbook at docs/deployment.md.

Guiding Principles

On-prem always. All LLM inference, OCR, and user-data processing run on the home server (192.168.68.42). No cloud APIs — OpenAI, Anthropic, Azure, AWS Bedrock/Textract, Google Document AI, Mistral, etc. are not to be used for user data or inference. LLM backend is Ollama (:11434); OCR runs locally (pluggable OCRClient interface, first engine: Surya on the RTX 3090); job state lives in local Postgres on the postgis container. The spec's references to Azure / AWS / OpenAI are examples to replace, not inherit.
Grounded extraction, not DB truth. ix returns best-effort extracted fields with segment citations, provenance, and cross-OCR agreement signals. ix does not claim its output is DB-grade; the calling service (e.g. mammon) owns the reliability decision (reconcile against anchors, stage for review, compare to deterministic parsers).
Transport-agnostic pipeline core. The pipeline (RequestIX → ResponseIX) knows nothing about HTTP, queues, or databases. Transport adapters (REST, Postgres queue, …) run in parallel alongside the core and all converge on one job store.

Habits

Feature branches + PRs. New work: git checkout -b feat/<name> → commit small, logical chunks → git push forgejo feat/<name> → create PR via Forgejo API → wait for tests to pass → merge → git push server main to deploy.
Keep documentation up to date in the same commit as the code. README.md, docs/, and AGENTS.md update alongside the change. Unpushed / undocumented work is work that isn't done.
Deploy after merging. git push server main rebuilds the Docker image via post-receive and restarts the container. Smoke-test the live service before walking away.
Never skip hooks (--no-verify, etc.) without explicit user approval. Prefer creating new commits over amending. Never force-push main.
Forgejo: repo at http://192.168.68.42:3030/goldstein/infoxtractor (to be created). Use basic auth with FORGEJO_USR / FORGEJO_PSD from ~/Projects/infrastructure/.env, or an API token once issued for this repo.

Tech Stack (MVP)

Language: Python 3.12, asyncio
Web/REST: FastAPI + uvicorn
OCR (pluggable): Surya OCR first (GPU, shares RTX 3090 with Ollama / Immich ML)
LLM: Ollama at 192.168.68.42:11434, structured outputs via JSON schema. Initial model candidate: qwen2.5:32b / qwen3:14b, configurable per use case
State: Postgres on the shared postgis container (:5431), new infoxtractor database
Deployment: Docker, git push server main → post-receive rebuild (pattern from other apps)

Repository / Deploy

Git remotes:
- forgejo: ssh://git@192.168.68.42:2222/goldstein/infoxtractor.git (source of truth / PRs)
- server: bare repo with post-receive rebuild hook (to be set up)
Workflow: feat branch → git push forgejo feat/name → PR via Forgejo API → merge → git push server main to deploy
Monitoring label: infrastructure.web_url=http://192.168.68.42:<PORT>
Backup opt-in: backup.enable=true label on the container

mammon (../mammon) — first consumer. Uses ix as a fallback / second opinion for Paperless-imported bank statements where deterministic parsers don't match.
infrastructure (../infrastructure) — server topology, deployment pattern, Ollama setup, shared postgis Postgres.

4 KiB Raw Permalink Blame History