feat(ingestion): fetch_file + MIME sniff + DocumentIngestor #11

Merged
goldstein merged 1 commit from feat/ingestion into main 2026-04-18 09:12:19 +00:00
Owner

Chunk 2, Task 2.3.

Adds ix.ingestion with three layered modules (fetch, mime, pages) per spec §6.1. Wired into SetupStep in Task 2.4.

Tests

21 new unit tests (141 total):

  • fetch: success + Authorization header, 4xx/5xx -> IX_000_007, timeout -> IX_000_007, oversize (global + per-file) -> IX_000_007, file:// happy + missing
  • mime: PDF / PNG / JPEG / TIFF sniffed, supported-set membership, require_supported gate
  • pages: 3-page PDF -> 3 Pages, 101-page PDF -> IX_000_006, multi-frame TIFF -> multiple Pages, texts -> zero-dim Pages, multi-file file_index

uv run pytest tests/unit -q -> 141 passed.
uv run ruff check src tests -> clean.

Merge gate

Forgejo Actions trigger bug is still in effect — local test + ruff are the gate.

Chunk 2, Task 2.3. Adds `ix.ingestion` with three layered modules (fetch, mime, pages) per spec §6.1. Wired into SetupStep in Task 2.4. ## Tests 21 new unit tests (141 total): - fetch: success + Authorization header, 4xx/5xx -> IX_000_007, timeout -> IX_000_007, oversize (global + per-file) -> IX_000_007, file:// happy + missing - mime: PDF / PNG / JPEG / TIFF sniffed, supported-set membership, require_supported gate - pages: 3-page PDF -> 3 Pages, 101-page PDF -> IX_000_006, multi-frame TIFF -> multiple Pages, texts -> zero-dim Pages, multi-file file_index `uv run pytest tests/unit -q` -> 141 passed. `uv run ruff check src tests` -> clean. ## Merge gate Forgejo Actions trigger bug is still in effect — local test + ruff are the gate.
goldstein added 1 commit 2026-04-18 09:12:14 +00:00
feat(ingestion): fetch_file + MIME sniff + DocumentIngestor (spec §6.1)
All checks were successful
tests / test (push) Successful in 57s
tests / test (pull_request) Successful in 1m12s
290e51416f
Three layered modules the SetupStep will wire together in Task 2.4.

- fetch.py: async httpx fetch with configurable timeouts + incremental
  size cap (stream=True, accumulate bytes, raise IX_000_007 when
  exceeded). file:// URLs read locally. Auth headers pass through. The
  caller injects a FetchConfig — env reads happen in ix.config (Chunk 3).
- mime.py: python-magic byte-sniff + SUPPORTED_MIMES frozenset +
  require_supported(mime) helper that raises IX_000_005.
- pages.py: DocumentIngestor.build_pages(files, texts) ->
  (list[Page], list[PageMetadata]). PDFs via PyMuPDF (hard 100 pg/PDF
  cap -> IX_000_006), images via Pillow (multi-frame TIFFs yield
  multiple Pages), texts as zero-dim Pages so GenAIStep can still cite
  them.

21 new unit tests (141 total) cover: fetch success with headers, 4xx/5xx
mapping, timeout -> IX_000_007, size cap enforced globally + per-file,
file:// happy path + missing file, MIME detection for PDF/PNG/JPEG/TIFF,
require_supported gate, PDF/TIFF/text page counts, 101-page PDF ->
IX_000_006, multi-file file_index assignment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
goldstein merged commit d801038c74 into main 2026-04-18 09:12:19 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: goldstein/infoxtractor#11
No description provided.