infoxtractor/AGENTS.md
Dirk Riemann 5ee74f367c
All checks were successful
tests / test (push) Successful in 1m52s
tests / test (pull_request) Successful in 1m45s
chore(model): switch default IX_DEFAULT_MODEL to qwen3:14b (already on host)
The home server's Ollama doesn't have gpt-oss:20b pulled; qwen3:14b is
already there and is what mammon's chat agent uses. Switching the default
now so the first deploy passes the /healthz ollama probe without an extra
`ollama pull` step. The spec lists gpt-oss:20b as a concrete example;
qwen3:14b is equally on-prem and Ollama-structured-output-compatible.

Touched: AppConfig default, BankStatementHeader Request.default_model,
.env.example, setup_server.sh ollama-list check, AGENTS.md, deployment.md,
live tests. Unit tests that hard-coded the old model string but don't
assert the default were left alone.

Also: ASCII en-dash in e2e_smoke.py Paperless-style text (ruff RUF001).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 12:20:23 +02:00

3.9 KiB

InfoXtractor (ix)

Async, on-prem, LLM-powered structured information extraction microservice. Given a document (PDF, image) or text plus a use case (a Pydantic schema), returns a structured JSON result with per-field provenance (source page, bounding box, OCR segment).

Designed to be used by other on-prem services (e.g. mammon) as a reliable fallback / second opinion for format-specific deterministic parsers.

Status: design phase. Full reference spec at docs/spec-core-pipeline.md. MVP spec will live at docs/superpowers/specs/.

Guiding Principles

  • On-prem always. All LLM inference, OCR, and user-data processing run on the home server (192.168.68.42). No cloud APIs — OpenAI, Anthropic, Azure, AWS Bedrock/Textract, Google Document AI, Mistral, etc. are not to be used for user data or inference. LLM backend is Ollama (:11434); OCR runs locally (pluggable OCRClient interface, first engine: Surya on the RTX 3090); job state lives in local Postgres on the postgis container. The spec's references to Azure / AWS / OpenAI are examples to replace, not inherit.
  • Grounded extraction, not DB truth. ix returns best-effort extracted fields with segment citations, provenance, and cross-OCR agreement signals. ix does not claim its output is DB-grade; the calling service (e.g. mammon) owns the reliability decision (reconcile against anchors, stage for review, compare to deterministic parsers).
  • Transport-agnostic pipeline core. The pipeline (RequestIXResponseIX) knows nothing about HTTP, queues, or databases. Transport adapters (REST, Postgres queue, …) run in parallel alongside the core and all converge on one job store.

Habits

  • Feature branches + PRs. New work: git checkout -b feat/<name> → commit small, logical chunks → git push forgejo feat/<name> → create PR via Forgejo API → wait for tests to pass → merge → git push server main to deploy.
  • Keep documentation up to date in the same commit as the code. README.md, docs/, and AGENTS.md update alongside the change. Unpushed / undocumented work is work that isn't done.
  • Deploy after merging. git push server main rebuilds the Docker image via post-receive and restarts the container. Smoke-test the live service before walking away.
  • Never skip hooks (--no-verify, etc.) without explicit user approval. Prefer creating new commits over amending. Never force-push main.
  • Forgejo: repo at http://192.168.68.42:3030/goldstein/infoxtractor (to be created). Use basic auth with FORGEJO_USR / FORGEJO_PSD from ~/Projects/infrastructure/.env, or an API token once issued for this repo.

Tech Stack (MVP)

  • Language: Python 3.12, asyncio
  • Web/REST: FastAPI + uvicorn
  • OCR (pluggable): Surya OCR first (GPU, shares RTX 3090 with Ollama / Immich ML)
  • LLM: Ollama at 192.168.68.42:11434, structured outputs via JSON schema. Initial model candidate: qwen2.5:32b / qwen3:14b, configurable per use case
  • State: Postgres on the shared postgis container (:5431), new infoxtractor database
  • Deployment: Docker, git push server main → post-receive rebuild (pattern from other apps)

Repository / Deploy

  • Git remotes:
    • forgejo: ssh://git@192.168.68.42:2222/goldstein/infoxtractor.git (source of truth / PRs)
    • server: bare repo with post-receive rebuild hook (to be set up)
  • Workflow: feat branch → git push forgejo feat/name → PR via Forgejo API → merge → git push server main to deploy
  • Monitoring label: infrastructure.web_url=http://192.168.68.42:<PORT>
  • Backup opt-in: backup.enable=true label on the container
  • mammon (../mammon) — first consumer. Uses ix as a fallback / second opinion for Paperless-imported bank statements where deterministic parsers don't match.
  • infrastructure (../infrastructure) — server topology, deployment pattern, Ollama setup, shared postgis Postgres.