7 changed files with 22 additions and 179 deletions
--- a/AGENTS.md
+++ b/AGENTS.md
@ -4,7 +4,7 @@ Async, on-prem, LLM-powered structured information extraction microservice. Give

 Designed to be used by other on-prem services (e.g. mammon) as a reliable fallback / second opinion for format-specific deterministic parsers.

-Status: MVP deployed (2026-04-18) at `http://192.168.68.42:8994` — LAN only. Full reference spec at `docs/spec-core-pipeline.md`; MVP spec at `docs/superpowers/specs/2026-04-18-ix-mvp-design.md`; deploy runbook at `docs/deployment.md`.
+Status: design phase. Full reference spec at `docs/spec-core-pipeline.md`. MVP spec will live at `docs/superpowers/specs/`.

 ## Guiding Principles

--- a/README.md
+++ b/README.md
@ -4,12 +4,10 @@ Async, on-prem, LLM-powered structured information extraction microservice.

 Given a document (PDF, image, text) and a named *use case*, ix returns a structured JSON result whose shape matches the use-case schema — together with per-field provenance (OCR segment IDs, bounding boxes, cross-OCR agreement flags) that let the caller decide how much to trust each extracted value.

-**Status:** MVP deployed. Live on the home LAN at `http://192.168.68.42:8994`.
+**Status:** design phase. Implementation about to start.

 - Full reference spec: [`docs/spec-core-pipeline.md`](docs/spec-core-pipeline.md) (aspirational; MVP is a strict subset)
 - **MVP design:** [`docs/superpowers/specs/2026-04-18-ix-mvp-design.md`](docs/superpowers/specs/2026-04-18-ix-mvp-design.md)
- **Implementation plan:** [`docs/superpowers/plans/2026-04-18-ix-mvp-implementation.md`](docs/superpowers/plans/2026-04-18-ix-mvp-implementation.md)
- **Deployment runbook:** [`docs/deployment.md`](docs/deployment.md)
 - Agent / development notes: [`AGENTS.md`](AGENTS.md)

 ## Principles
@ -17,44 +15,3 @@ Given a document (PDF, image, text) and a named *use case*, ix returns a structu
 - **On-prem always.** LLM = Ollama, OCR = local engines (Surya first). No OpenAI / Anthropic / Azure / AWS / cloud.
 - **Grounded extraction, not DB truth.** ix returns best-effort fields + provenance; the caller decides what to trust.
 - **Transport-agnostic pipeline core.** REST + Postgres-queue adapters in parallel on one job store.
-
-## Submitting a job
-
-```bash
-curl -X POST http://192.168.68.42:8994/jobs \
-  -H "Content-Type: application/json" \
-  -d '{
-    "use_case": "bank_statement_header",
-    "ix_client_id": "mammon",
-    "request_id": "some-correlation-id",
-    "context": {
-      "files": [{
-        "url": "http://paperless.local/api/documents/42/download/",
-        "headers": {"Authorization": "Token …"}
-      }],
-      "texts": ["<Paperless Tesseract OCR content>"]
-    }
-  }'
-# → {"job_id":"…","ix_id":"…","status":"pending"}
-```
-
-Poll `GET /jobs/{job_id}` until `status` is `done` or `error`. Optionally pass `callback_url` to receive a webhook on completion (one-shot, no retry; polling stays authoritative).
-
-Full REST surface + provenance response shape documented in the MVP design spec.
-
-## Running locally
-
-```bash
-uv sync --extra dev
-uv run pytest tests/unit -v                    # hermetic unit + integration suite
-IX_TEST_OLLAMA=1 uv run pytest tests/live -v    # needs LAN access to Ollama + GPU
-```
-
-## Deploying
-
-```bash
-git push server main      # rebuilds Docker image, restarts container, /healthz deploy gate
-python scripts/e2e_smoke.py   # E2E acceptance against the live service
-```
-
-See [`docs/deployment.md`](docs/deployment.md) for full runbook + rollback.
--- a/docker-compose.yml
+++ b/docker-compose.yml
@ -10,8 +10,6 @@
 # The GPU reservation block matches immich-ml / the shape Docker Compose
 # expects for GPU allocation on this host.

-name: infoxtractor
-
 services:
  infoxtractor:
    build: .
--- a/docs/deployment.md
+++ b/docs/deployment.md
@ -71,30 +71,13 @@ git push server main

 ## First deploy

- **Date:** 2026-04-18
- **Commit:** `fix/ollama-extract-json` (#36, the last of several Docker/ops follow-ups after PR #27 shipped the initial Dockerfile)
- **`/healthz`:** all three probes (`postgres`, `ollama`, `ocr`) green. First-pass took ~7 min for the fresh container because Surya's recognition (1.34 GB) + detection (73 MB) models download from HuggingFace on first run; subsequent rebuilds reuse the named volumes declared in `docker-compose.yml` and come up in <30 s.
- **E2E extraction:** `bank_statement_header` against `tests/fixtures/synthetic_giro.pdf` with Paperless-style texts:
-  - Pipeline completes in **35 s**.
-  - Extracted: `bank_name=DKB`, `account_iban=DE89370400440532013000`, `currency=EUR`, `opening_balance=1234.56`, `closing_balance=1450.22`, `statement_date=2026-03-31`, `statement_period_end=2026-03-31`, `statement_period_start=2026-03-01`, `account_type=null`.
-  - Provenance: 8 / 9 leaf fields have sources; 7 / 8 `provenance_verified` and `text_agreement` are True. `statement_period_start` shows up in the OCR but normalisation fails (dateutil picks a different interpretation of the cited day); to be chased in a follow-up.
+_(fill in after running — timestamps, commit sha, e2e_smoke output)_

-### Docker-ops follow-ups that landed during the first deploy
-
-All small, each merged as its own PR. In commit order after the scaffold (#27):
-
- **#31** `fix(docker): uv via standalone installer` — Python 3.12 on Ubuntu 22.04 drops `distutils`; Ubuntu's pip needed it. Switched to the `uv` standalone installer, which has no pip dependency.
- **#32** `fix(docker): include README.md in the uv sync COPY` — `hatchling` validates the readme file exists when resolving the editable project install.
- **#33** `fix(compose): drop runtime: nvidia` — the deploy host's Docker daemon doesn't register a named `nvidia` runtime; `deploy.resources.devices` is sufficient and matches immich-ml.
- **#34** `fix(deploy): network_mode: host` — `postgis` is bound to `127.0.0.1` on the host (security hardening T12). `host.docker.internal` points at the bridge gateway, not loopback, so the container couldn't reach postgis. Goldstein uses the same pattern.
- **#35** `fix(deps): pin surya-ocr ^0.17` — earlier cu124 torch pin had forced surya to 0.14.1, which breaks our `surya.foundation` import and needs a transformers version that lacks `QuantizedCacheConfig`.
- **#36** `fix(genai): drop Ollama format flag; extract trailing JSON` — Ollama 0.11.8 segfaults on Pydantic JSON Schemas (`$ref`, `anyOf`, `pattern`), and `format="json"` terminates reasoning models (qwen3) at `{}` because their `<think>…</think>` chain-of-thought isn't valid JSON. Omit the flag, inject the schema into the system prompt, extract the outermost `{…}` balanced block from the response.
- **volumes** — named `ix_surya_cache` + `ix_hf_cache` mount `/root/.cache/datalab` + `/root/.cache/huggingface` so rebuilds don't re-download ~1.5 GB of model weights.
-
-Production notes:
-
- `IX_DEFAULT_MODEL=qwen3:14b` (already pulled on the host). Spec listed `gpt-oss:20b` as a concrete example; swapped to keep the deploy on-prem without an extra `ollama pull`.
- Torch 2.11 default cu13 wheels fall back to CPU against the host's CUDA 12.4 driver — Surya runs on CPU. Expected inference times: seconds per page. Upgrading the NVIDIA driver (or pinning a cu12-compatible torch wheel newer than 2.7) will unlock GPU with no code changes.
+- **Date:** TBD
+- **Commit:** TBD
+- **`/healthz` first-ok time:** TBD
+- **`e2e_smoke.py` status:** TBD
+- **Notes:** —

 ## E2E smoke test (`scripts/e2e_smoke.py`)

--- a/pyproject.toml
+++ b/pyproject.toml
@ -1,8 +1,6 @@
 [project]
 name = "infoxtractor"
 version = "0.1.0"
-# Released 2026-04-18 with the first live deploy of the MVP. See
-# docs/deployment.md §"First deploy" for the commit + /healthz times.
 description = "Async on-prem LLM-powered structured information extraction microservice"
 readme = "README.md"
 requires-python = ">=3.12"
--- a/src/ix/genai/ollama_client.py
+++ b/src/ix/genai/ollama_client.py
@ -96,9 +96,8 @@ class OllamaClient:
            ) from exc

        content = (payload.get("message") or {}).get("content") or ""
-        json_blob = _extract_json_blob(content)
        try:
-            parsed = response_schema.model_validate_json(json_blob)
+            parsed = response_schema.model_validate_json(content)
        except ValidationError as exc:
            raise IXException(
                IXErrorCode.IX_002_001,
@ -160,39 +159,18 @@ class OllamaClient:
        request_kwargs: dict[str, Any],
        response_schema: type[BaseModel],
    ) -> dict[str, Any]:
-        """Map provider-neutral kwargs to Ollama's /api/chat body.
-
-        Schema strategy for Ollama 0.11.8: we pass ``format="json"`` (loose
-        JSON mode) and bake the Pydantic schema into a system message
-        ahead of the caller's own system prompt. Rationale:
-
-        * The full Pydantic schema as ``format=<schema>`` crashes llama.cpp's
-          structured-output implementation (SIGSEGV) on every non-trivial
-          shape — ``anyOf`` / ``$ref`` / ``pattern`` all trigger it.
-        * ``format="json"`` alone guarantees valid JSON but not the shape;
-          models routinely return ``{}`` when not told what fields to emit.
-        * Injecting the schema into the prompt is the cheapest way to
-          get both: the model sees the expected shape explicitly, Pydantic
-          validates the response at parse time (IX_002_001 on mismatch).
-
-        Non-Ollama ``GenAIClient`` impls can ignore this behaviour and use
-        native structured-output (``response_format`` on OpenAI, etc.).
-        """
+        """Map provider-neutral kwargs to Ollama's /api/chat body."""

        messages = self._translate_messages(
            list(request_kwargs.get("messages") or [])
        )
-        messages = _inject_schema_system_message(messages, response_schema)
        body: dict[str, Any] = {
            "model": request_kwargs.get("model"),
            "messages": messages,
            "stream": False,
-            # NOTE: format is deliberately omitted. `format="json"` made
-            # reasoning models (qwen3) abort after emitting `{}` because the
-            # constrained sampler terminated before the chain-of-thought
-            # finished; `format=<schema>` segfaulted Ollama 0.11.8. Letting
-            # the model stream freely and then extracting the trailing JSON
-            # blob works for both reasoning and non-reasoning models.
+            "format": _sanitise_schema_for_ollama(
+                response_schema.model_json_schema()
+            ),
        }

        options: dict[str, Any] = {}
@ -224,71 +202,6 @@ class OllamaClient:
        return out


-def _extract_json_blob(text: str) -> str:
-    """Return the outermost balanced JSON object in ``text``.
-
-    Reasoning models (qwen3, deepseek-r1) wrap their real answer in
-    ``<think>…</think>`` blocks. Other models sometimes prefix prose or
-    fence the JSON in ```json``` code blocks. Finding the last balanced
-    ``{…}`` is the cheapest robust parse that works for all three shapes;
-    a malformed response yields the full text and Pydantic catches it
-    downstream as ``IX_002_001``.
-    """
-    start = text.find("{")
-    if start < 0:
-        return text
-    depth = 0
-    in_string = False
-    escaped = False
-    for i in range(start, len(text)):
-        ch = text[i]
-        if in_string:
-            if escaped:
-                escaped = False
-            elif ch == "\\":
-                escaped = True
-            elif ch == '"':
-                in_string = False
-            continue
-        if ch == '"':
-            in_string = True
-        elif ch == "{":
-            depth += 1
-        elif ch == "}":
-            depth -= 1
-            if depth == 0:
-                return text[start : i + 1]
-    return text[start:]
-
-
-def _inject_schema_system_message(
-    messages: list[dict[str, Any]],
-    response_schema: type[BaseModel],
-) -> list[dict[str, Any]]:
-    """Prepend a system message that pins the expected JSON shape.
-
-    Ollama's ``format="json"`` mode guarantees valid JSON but not the
-    field set or names. We emit the Pydantic schema as JSON and
-    instruct the model to match it. If the caller already provides a
-    system message, we prepend ours; otherwise ours becomes the first
-    system turn.
-    """
-    import json as _json
-
-    schema_json = _json.dumps(
-        _sanitise_schema_for_ollama(response_schema.model_json_schema()),
-        indent=2,
-    )
-    guidance = (
-        "Respond ONLY with a single JSON object matching this JSON Schema "
-        "exactly. No prose, no code fences, no explanations. All top-level "
-        "properties listed in `required` MUST be present. Use null for "
-        "fields you cannot confidently extract. The JSON Schema:\n"
-        f"{schema_json}"
-    )
-    return [{"role": "system", "content": guidance}, *messages]
-
-
 def _sanitise_schema_for_ollama(schema: Any) -> Any:
    """Strip null branches from ``anyOf`` unions.

--- a/tests/unit/test_ollama_client.py
+++ b/tests/unit/test_ollama_client.py
@ -79,19 +79,16 @@ class TestInvokeHappyPath:
        body_json = json.loads(body)
        assert body_json["model"] == "gpt-oss:20b"
        assert body_json["stream"] is False
-        # No `format` is sent: Ollama 0.11.8 segfaults on full schemas and
-        # aborts to `{}` with `format=json` on reasoning models. Schema is
-        # injected into the system prompt instead; we extract the trailing
-        # JSON blob from the response and validate via Pydantic.
-        assert "format" not in body_json
+        # Format is the pydantic schema with Optional `anyOf [T, null]`
+        # patterns collapsed to just T — Ollama 0.11.8 segfaults on the
+        # anyOf+null shape, so we sanitise before sending.
+        fmt = body_json["format"]
+        assert fmt["properties"]["bank_name"] == {"title": "Bank Name", "type": "string"}
+        assert fmt["properties"]["account_number"]["type"] == "string"
+        assert "anyOf" not in fmt["properties"]["account_number"]
        assert body_json["options"]["temperature"] == 0.2
        assert "reasoning_effort" not in body_json
-        # A schema-guidance system message is prepended to the caller's
-        # messages so Ollama (format=json loose mode) emits the right shape.
-        msgs = body_json["messages"]
-        assert msgs[0]["role"] == "system"
-        assert "JSON Schema" in msgs[0]["content"]
-        assert msgs[1:] == [
+        assert body_json["messages"] == [
            {"role": "system", "content": "You extract."},
            {"role": "user", "content": "Doc body"},
        ]
@ -125,10 +122,7 @@ class TestInvokeHappyPath:
        import json

        request_body = json.loads(httpx_mock.get_requests()[0].read())
-        # First message is the auto-injected schema guidance; after that
-        # the caller's user message has its text parts joined.
-        assert request_body["messages"][0]["role"] == "system"
-        assert request_body["messages"][1:] == [
+        assert request_body["messages"] == [
            {"role": "user", "content": "part-a\npart-b"}
        ]