infoxtractor/src/ix
Dirk Riemann 290e51416f
All checks were successful
tests / test (push) Successful in 57s
tests / test (pull_request) Successful in 1m12s
feat(ingestion): fetch_file + MIME sniff + DocumentIngestor (spec §6.1)
Three layered modules the SetupStep will wire together in Task 2.4.

- fetch.py: async httpx fetch with configurable timeouts + incremental
  size cap (stream=True, accumulate bytes, raise IX_000_007 when
  exceeded). file:// URLs read locally. Auth headers pass through. The
  caller injects a FetchConfig — env reads happen in ix.config (Chunk 3).
- mime.py: python-magic byte-sniff + SUPPORTED_MIMES frozenset +
  require_supported(mime) helper that raises IX_000_005.
- pages.py: DocumentIngestor.build_pages(files, texts) ->
  (list[Page], list[PageMetadata]). PDFs via PyMuPDF (hard 100 pg/PDF
  cap -> IX_000_006), images via Pillow (multi-frame TIFFs yield
  multiple Pages), texts as zero-dim Pages so GenAIStep can still cite
  them.

21 new unit tests (141 total) cover: fetch success with headers, 4xx/5xx
mapping, timeout -> IX_000_007, size cap enforced globally + per-file,
file:// happy path + missing file, MIME detection for PDF/PNG/JPEG/TIFF,
require_supported gate, PDF/TIFF/text page counts, 101-page PDF ->
IX_000_006, multi-file file_index assignment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 11:12:00 +02:00
..
contracts feat(contracts): ResponseIX + Provenance + Job envelope (spec §3, §9.3) 2026-04-18 10:50:22 +02:00
genai feat(clients): OCRClient + GenAIClient protocols + fakes (spec §6.2, §6.3) 2026-04-18 11:08:24 +02:00
ingestion feat(ingestion): fetch_file + MIME sniff + DocumentIngestor (spec §6.1) 2026-04-18 11:12:00 +02:00
ocr feat(clients): OCRClient + GenAIClient protocols + fakes (spec §6.2, §6.3) 2026-04-18 11:08:24 +02:00
pipeline feat(pipeline): Step ABC + Pipeline runner + Timer (spec §3, §4) 2026-04-18 11:06:46 +02:00
provenance feat(provenance): mapper + verifier for ReliabilityStep (spec §9.4, §6) 2026-04-18 11:01:19 +02:00
segmentation feat(segmentation): SegmentIndex + prompt-text formatter (spec §9.1) 2026-04-18 10:53:46 +02:00
use_cases feat(use_cases): registry + bank_statement_header (spec §7) 2026-04-18 10:51:43 +02:00
__init__.py feat(scaffold): project skeleton with uv + pytest + forgejo CI 2026-04-18 10:36:43 +02:00
errors.py feat(errors): add IXException + IXErrorCode per spec §8 2026-04-18 10:46:01 +02:00