Async on-prem LLM-powered structured information extraction microservice

Find a file

Dirk Riemann 1e340c82fa All checks were successful tests / test (pull_request) Successful in 1m10s Details tests / test (push) Successful in 1m11s Details feat(provenance): mapper + verifier for ReliabilityStep (spec §9.4, §6) Lands the two remaining provenance-subsystem pieces: mapper.py — map_segment_refs_to_provenance: - For each LLM SegmentCitation, pick seg-ids per source_type (`value` vs `value_and_context`), cap at max_sources_per_field, resolve each via SegmentIndex, track invalid references. - Resolve field values by dot-path (`result.items[0].name` supported — `[N]` bracket notation is normalised to `.N` before traversal). - Skip fields that resolve to zero valid sources (spec §9.4). - Write quality_metrics with fields_with_provenance / total_fields / coverage_rate / invalid_references. verify.py — verify_field + apply_reliability_flags: - Dispatches per Pydantic field type: date → parse-both-sides compare; int/float/Decimal → normalize + whole-snippet / numeric-token scan; IBAN (detected via `iban` in field name) → upper+strip compare; Literal / None → flags stay None; else string substring. - _unwrap_optional handles BOTH typing.Union AND types.UnionType so `Decimal \| None` (PEP 604, what get_type_hints emits on 3.12+) resolves correctly — caught by the integration-style test_writes_flags_and_counters. - Number comparator scans numeric tokens in the snippet so labels ("Closing balance CHF 1'234.56") don't mask the match. - apply_reliability_flags mutates the passed ProvenanceData in place and writes verified_fields / text_agreement_fields to quality_metrics. Tests cover each comparator, Literal/None skip, short-value skip (strings and numerics), Decimal via optional union, and end-to-end flag+counter writing against a Pydantic use-case schema that mirrors bank_statement_header. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-04-18 11:01:19 +02:00
.forgejo/workflows	ci: run on every push (not just main) so feat branches also get CI	2026-04-18 10:40:44 +02:00
docs	Implementation plan for ix MVP	2026-04-18 10:34:30 +02:00
src/ix	feat(provenance): mapper + verifier for ReliabilityStep (spec §9.4, §6)	2026-04-18 11:01:19 +02:00
tests	feat(provenance): mapper + verifier for ReliabilityStep (spec §9.4, §6)	2026-04-18 11:01:19 +02:00
.env.example	feat(scaffold): project skeleton with uv + pytest + forgejo CI	2026-04-18 10:36:43 +02:00
.gitignore	feat(scaffold): project skeleton with uv + pytest + forgejo CI	2026-04-18 10:36:43 +02:00
.python-version	feat(scaffold): project skeleton with uv + pytest + forgejo CI	2026-04-18 10:36:43 +02:00
AGENTS.md	Initial design: on-prem LLM extraction microservice MVP	2026-04-18 10:23:17 +02:00
pyproject.toml	feat(scaffold): project skeleton with uv + pytest + forgejo CI	2026-04-18 10:36:43 +02:00
README.md	Initial design: on-prem LLM extraction microservice MVP	2026-04-18 10:23:17 +02:00
uv.lock	feat(scaffold): project skeleton with uv + pytest + forgejo CI	2026-04-18 10:36:43 +02:00

README.md

InfoXtractor (ix)

Async, on-prem, LLM-powered structured information extraction microservice.

Given a document (PDF, image, text) and a named use case, ix returns a structured JSON result whose shape matches the use-case schema — together with per-field provenance (OCR segment IDs, bounding boxes, cross-OCR agreement flags) that let the caller decide how much to trust each extracted value.

Status: design phase. Implementation about to start.

Full reference spec: docs/spec-core-pipeline.md (aspirational; MVP is a strict subset)
MVP design: docs/superpowers/specs/2026-04-18-ix-mvp-design.md
Agent / development notes: AGENTS.md

Principles

On-prem always. LLM = Ollama, OCR = local engines (Surya first). No OpenAI / Anthropic / Azure / AWS / cloud.
Grounded extraction, not DB truth. ix returns best-effort fields + provenance; the caller decides what to trust.
Transport-agnostic pipeline core. REST + Postgres-queue adapters in parallel on one job store.