infoxtractor

Author	SHA1	Message	Date
Dirk Riemann	dcd1bc764a	feat(pipeline): Step ABC + Pipeline runner + Timer (spec §3, §4) All checks were successful tests / test (push) Successful in 56s Details tests / test (pull_request) Successful in 1m7s Details Adds the transport-agnostic pipeline orchestrator. Each step implements async validate + process; the runner wraps both in a Timer, writes per-step entries to response.metadata.timings, and aborts on the first IXException by writing response.error. - Step exposes a step_name property (defaults to class name) so tests and logs label steps consistently. - Timer is a plain context manager that appends one {step, elapsed_seconds} entry on exit regardless of whether the body raised, so the timeline stays reconstructable for failed steps. - 9 unit tests cover ordering, skip-on-false, IXException in validate vs. process, timings populated for every executed step, and shared-response mutation across steps. Non-IX exceptions propagate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 11:06:46 +02:00
goldstein	b397a80c0b	feat(provenance): mapper + verifier (spec §9.4, §6) (#8 ) Some checks are pending tests / test (push) Waiting to run Details Provenance mapper and reliability verifier land.	2026-04-18 09:01:35 +00:00
Dirk Riemann	1e340c82fa	feat(provenance): mapper + verifier for ReliabilityStep (spec §9.4, §6) All checks were successful tests / test (pull_request) Successful in 1m10s Details tests / test (push) Successful in 1m11s Details Lands the two remaining provenance-subsystem pieces: mapper.py — map_segment_refs_to_provenance: - For each LLM SegmentCitation, pick seg-ids per source_type (`value` vs `value_and_context`), cap at max_sources_per_field, resolve each via SegmentIndex, track invalid references. - Resolve field values by dot-path (`result.items[0].name` supported — `[N]` bracket notation is normalised to `.N` before traversal). - Skip fields that resolve to zero valid sources (spec §9.4). - Write quality_metrics with fields_with_provenance / total_fields / coverage_rate / invalid_references. verify.py — verify_field + apply_reliability_flags: - Dispatches per Pydantic field type: date → parse-both-sides compare; int/float/Decimal → normalize + whole-snippet / numeric-token scan; IBAN (detected via `iban` in field name) → upper+strip compare; Literal / None → flags stay None; else string substring. - _unwrap_optional handles BOTH typing.Union AND types.UnionType so `Decimal \| None` (PEP 604, what get_type_hints emits on 3.12+) resolves correctly — caught by the integration-style test_writes_flags_and_counters. - Number comparator scans numeric tokens in the snippet so labels ("Closing balance CHF 1'234.56") don't mask the match. - apply_reliability_flags mutates the passed ProvenanceData in place and writes verified_fields / text_agreement_fields to quality_metrics. Tests cover each comparator, Literal/None skip, short-value skip (strings and numerics), Decimal via optional union, and end-to-end flag+counter writing against a Pydantic use-case schema that mirrors bank_statement_header. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 11:01:19 +02:00
goldstein	2d22115893	feat(provenance): normalisers + short-value skip rule (spec §6) (#7 ) Some checks are pending tests / test (push) Waiting to run Details Normalizer primitives land.	2026-04-18 08:56:45 +00:00
Dirk Riemann	527fc620fe	feat(provenance): normalisers + short-value skip rule (spec §6) All checks were successful tests / test (pull_request) Successful in 1m0s Details tests / test (push) Successful in 1m28s Details Pure functions the ReliabilityStep will compose to compare extracted values against OCR snippets (and context.texts). Kept in one module so every rule is directly unit-testable without pulling in the step ABC. Highlights: - `normalize_string`: NFKC + casefold + strip common punctuation (. , : ; ! ? () [] {} / \\ ' " `) + collapse whitespace. Substring-compatible. - `normalize_number`: returns the canonical "[-]DDD.DD" form (always 2dp) after stripping currency symbols. Heuristic separator detection handles Swiss-German apostrophes ("1'234.56"), de-DE commas ("1.234,56"), and plain ASCII ("1234.56" / "1234.5" / "1234"). Accepts native int/float/ Decimal as well as str. - `normalize_date`: dateutil parse with dayfirst=True → ISO YYYY-MM-DD. Date and datetime objects short-circuit to their isoformat(). - `normalize_iban`: uppercase + strip whitespace. Format validation is the call site's job; this is pure canonicalisation. - `should_skip_text_agreement`: dispatches on type + value. Literal → skip, None → skip, numeric \|v\|<10 → skip, len(str) ≤ 2 → skip. Numeric check runs first so `10` (len("10")==2) is treated on the numeric side (not skipped) instead of tripping the string length rule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 10:56:31 +02:00
goldstein	b2ff27c1ca	feat(segmentation): SegmentIndex + prompt-text formatter (spec §9.1) (#6 ) Some checks are pending tests / test (push) Waiting to run Details SegmentIndex lands.	2026-04-18 08:54:02 +00:00
Dirk Riemann	1321d57354	feat(segmentation): SegmentIndex + prompt-text formatter (spec §9.1) All checks were successful tests / test (push) Successful in 58s Details tests / test (pull_request) Successful in 56s Details Builds the ID <-> on-page-anchor map used by both the GenAIStep (to emit the segment-tagged user message) and the provenance mapper (to resolve LLM-cited IDs back to bbox/text/file_index). Design notes: - `build()` is a classmethod so the pipeline constructs the index in one place (OCRStep) and passes the constructed instance along in the internal context. No mutable global state; tests build indexes inline from fake OCR fixtures. - Per-page metadata (file_index) arrives via a parallel `list[PageMetadata]` rather than being smuggled into OCRResult. Keeps segmentation decoupled from ingestion — the OCR engine legitimately doesn't know which file a page came from. - Page-tag lines (`<page …>` / `</page>`) are filtered via a regex so the LLM can never cite them as provenance. `line_idx_in_page` increments only for real lines so the IDs stay dense (p1_l0, p1_l1, ...). - Bounding-box normalisation divides x-coords by page width, y-coords by page height. Zero dimensions (defensive) pass through unchanged. - `to_prompt_text(context_texts=[...])` appends paperless-style texts untagged, separated from the tagged body by a blank line (spec §7.2b). Deterministic for prompt caching. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 10:53:46 +02:00
goldstein	810979e416	feat(use_cases): registry + bank_statement_header (spec §7) (#5 ) Some checks are pending tests / test (push) Waiting to run Details First use case lands.	2026-04-18 08:51:58 +00:00
Dirk Riemann	b80c7952f7	feat(use_cases): registry + bank_statement_header (spec §7) All checks were successful tests / test (pull_request) Successful in 1m0s Details tests / test (push) Successful in 58s Details First use case lands. The schema is intentionally flat — nine scalar fields, no nested arrays — because Ollama's structured-output guidance stays most reliable when the top level has only scalars, and every field we care about (bank_name, IBAN, period, opening/closing balance) can be rendered as one. Registration is explicit in `use_cases/__init__.py`, not a side effect of importing the use-case module. That keeps load order obvious and lets tests patch the registry without having to reload modules. `get_use_case(name)` is the one-liner adapters use; it raises `IX_001_001` with the offending name in `detail` when the lookup misses, which keeps log-scrape simple. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 10:51:43 +02:00
goldstein	230068e484	feat(contracts): ResponseIX + Provenance + Job (spec §3, §9.3) (#4 ) Some checks are pending tests / test (push) Waiting to run Details Lands the outgoing-response data contracts.	2026-04-18 08:50:37 +00:00
Dirk Riemann	02db3b05cc	feat(contracts): ResponseIX + Provenance + Job envelope (spec §3, §9.3) All checks were successful tests / test (push) Successful in 1m2s Details tests / test (pull_request) Successful in 1m0s Details Completes the data-contract layer. Highlights: - `ResponseIX.context` is an internal mutable accumulator used by pipeline steps (pages, files, texts, use_case classes, segment index). It MUST NOT leak into the serialised response, so we mark the field with `Field(exclude=True)` and carry the shape in a small `_InternalContext` sub-model with `extra="allow"` so steps can stash arbitrary state without schema churn. Tested: `model_dump()` and `model_dump_json()` both drop it. - `FieldProvenance` gains `provenance_verified: bool \| None` and `text_agreement: bool \| None` — the two MVP reliability flags written by the new ReliabilityStep. Both default None so rows predating the ReliabilityStep (empty LLM output, cloud-import replay) parse cleanly. - `quality_metrics` stays a free-form `dict[str, Any]` — the MVP adds `verified_fields` and `text_agreement_fields` counters without carving them into the schema, which keeps future metric additions free. - `Job.status` and `Job.callback_status` are `Literal[...]` so Pydantic rejects unknown states at the edge. Invariant (`status='done' iff response.error is None`) stays worker-enforced — callers sometimes hydrate in-flight rows and we do not want validation to reject them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 10:50:22 +02:00
goldstein	5990218172	feat(contracts): RequestIX + Context + Options (spec §3) (#3 ) Some checks are pending tests / test (push) Waiting to run Details Lands the incoming-request Pydantic v2 contracts.	2026-04-18 08:47:47 +00:00
Dirk Riemann	181cc0fbea	feat(contracts): RequestIX + Context + Options per spec §3 All checks were successful tests / test (push) Successful in 1m2s Details tests / test (pull_request) Successful in 1m6s Details Adds the incoming-request data contracts as Pydantic v2 models. Matches the MVP spec §3 exactly — fields dropped from the reference spec (use_vision, reasoning_effort, version, ...) stay out, and `extra="forbid"` catches any caller that sends them so drift surfaces immediately instead of silently. Context.files is `list[str \| FileRef]`: plain URLs stay str, dict entries parse as FileRef. This keeps the common case (public URL) one-liner while still supporting Paperless-style auth headers and per-file size caps. ix_id stays optional with a docstring warning that callers MUST NOT set it — the transport layer assigns the 16-char hex handle on insert. The field is present so `Job` round-trips out of the store. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 10:47:31 +02:00
goldstein	ebdba99d9f	feat(errors): IXException + IXErrorCode (spec §8) (#2 ) Some checks are pending tests / test (push) Waiting to run Details Lands the single exception type and ten IX_* codes used throughout the pipeline.	2026-04-18 08:46:19 +00:00
Dirk Riemann	ae595c937a	feat(errors): add IXException + IXErrorCode per spec §8 All checks were successful tests / test (push) Successful in 1m2s Details tests / test (pull_request) Successful in 59s Details Adds the single exception type used throughout the pipeline. Every failure maps to one of the ten IX_* codes from the MVP spec §8 with a stable machine-readable code and an optional free-form detail. The `str()` form is log-scrapable with a single regex (`IX_xxx_xxx: <msg> (detail=...)`), so mammon-side reliability UX can classify failures without brittle string parsing. Enum values equal names so callers can serialise either. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 10:46:01 +02:00
goldstein	663cb4ae10	feat(scaffold): project skeleton with uv + pytest + forgejo CI (#1 ) Some checks are pending tests / test (push) Waiting to run Details Lands Task 1.1 from the MVP plan: empty-project skeleton so later tasks have somewhere to land. Local tests + ruff pass. CI trigger fix included so feat branches get runs going forward.	2026-04-18 08:42:56 +00:00
Dirk Riemann	4120d106aa	ci: trigger re-run All checks were successful tests / test (push) Successful in 1m0s Details tests / test (pull_request) Successful in 57s Details	2026-04-18 10:41:57 +02:00
Dirk Riemann	097ebf5db7	ci: run on every push (not just main) so feat branches also get CI Some checks are pending tests / test (push) Waiting to run Details tests / test (pull_request) Successful in 57s Details Matches mammon's pattern more closely and makes PR CI reliable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 10:40:44 +02:00
Dirk Riemann	7e141829ac	fix(ci): create empty tests/integration so pytest doesn't error on missing dir All checks were successful tests / test (pull_request) Successful in 1m4s Details Integration tests land in Chunk 3; until then CI needs the directory to exist. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 10:39:26 +02:00
Dirk Riemann	a71f023ed9	fix(ci): match mammon's Forgejo Actions pattern (no explicit container image) Some checks failed tests / test (pull_request) Failing after 59s Details The previous python:3.12-slim container lacked node, which actions/checkout@v4 requires. The Forgejo runner's default image includes node + apt + curl, so we can bootstrap python + uv the same way mammon does. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 10:37:56 +02:00
Dirk Riemann	57cdfd73fb	feat(scaffold): project skeleton with uv + pytest + forgejo CI Some checks failed CI / test (pull_request) Failing after 4s Details - pyproject.toml: runtime deps (FastAPI, SQLAlchemy async, Pydantic, PyMuPDF, python-magic, Pillow, dateutil), dev group (pytest, pytest-asyncio, pytest-httpx, ruff, mypy), optional `ocr` extra that pulls surya-ocr + torch (kept optional so CI without GPU can run the base package). - pytest config: asyncio_mode=auto; `live` marker for tests that need a real Ollama/Surya (gated on IX_TEST_OLLAMA=1). - Single smoke test (tests/unit/test_scaffolding.py) verifies the package imports and exposes __version__ — keeps CI green until the real test modules land in later chunks. - .forgejo/workflows/ci.yml: runs ruff + pytest against a Postgres 16 service container. Explicit IX_TEST_MODE=fake keeps real-client tests out. - .env.example: every IX_* var from spec §9 with on-prem-friendly defaults. - uv.lock committed for reproducible builds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 10:36:43 +02:00
Dirk Riemann	86538ee8de	Implementation plan for ix MVP Detailed, TDD-structured plan with 5 chunks covering ~30 feature-branch tasks from foundation scaffolding through first live deploy + E2E smoke. Each task is one PR; pipeline core comes hermetic-first, real Surya/Ollama clients in Chunk 4, containerization + first deploy in Chunk 5. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 10:34:30 +02:00
Dirk Riemann	5e007b138d	Address spec review — auth, timeouts, lifecycle, error codes - FileRef type added so callers (mammon/Paperless) can pass Authorization headers alongside URLs. context.files is now list[str \| FileRef]. - Job lifecycle state machine pinned down, including worker-startup sweep for rows stuck in 'running' after a crash. - Explicit IX_002_000 / IX_002_001 codes for Ollama unreachable and structured-output schema violations, with per-call timeout IX_GENAI_CALL_TIMEOUT_SECONDS distinct from the per-job timeout. - IX_000_007 code for file-fetch failures; per-file size, connect, and read timeouts configurable via env. - ReliabilityStep: Literal-typed fields and None values explicitly skipped from provenance verification (with reason); dates parse both sides before ISO comparison. - /healthz semantics pinned down (CUDA + Surya loaded; Ollama reachable AND model available). /metrics window is last 24h. - (client_id, request_id) is UNIQUE in ix_jobs, matching the idempotency claim. - Deploy-failure workflow uses `git revert` forward commit, not force-push — aligned with AGENTS.md habits. - Dockerfile / compose require --gpus all. Pre-deploy requires `ollama pull gpt-oss:20b`; /healthz verifies before deploy completes. - CI clarified: Forgejo Actions runners are GPU-less and LAN-disconnected; all inference is stubbed there. Real-Ollama tests behind IX_TEST_OLLAMA=1. - Fixture redaction stance: synthetic-template PDF committed; real redacted fixtures live out-of-repo. - Deferred list picks up use_case URL/Base64, callback retries, multi-container workers. quality_metrics retains reference-spec counters plus the two new MVP ones. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 10:28:43 +02:00
Dirk Riemann	124403252d	Initial design: on-prem LLM extraction microservice MVP Establishes ix as an async, on-prem, LLM-powered structured extraction microservice. Full reference spec stays in docs/spec-core-pipeline.md; MVP spec (strict subset — Ollama only, Surya OCR, REST + Postgres-queue transports in parallel, in-repo use cases, provenance-based reliability signals) lives at docs/superpowers/specs/2026-04-18-ix-mvp-design.md. First use case: bank_statement_header (feeds mammon's needs_parser flow). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 10:23:17 +02:00

24 commits