feat(provenance): normalisers + short-value skip rule (spec §6) #7

Merged
goldstein merged 1 commit from feat/provenance-normalize into main 2026-04-18 08:56:46 +00:00
Owner

Pure functions that compose into the ReliabilityStep (task 1.8). Every rule is directly unit-testable.

  • normalize_string (NFKC + casefold + strip punctuation + collapse whitespace)
  • normalize_number (canonical -DDD.DD; CHF/de-DE/ASCII heuristics)
  • normalize_date (dateutil dayfirst=True → ISO)
  • normalize_iban (uppercase + strip whitespace)
  • should_skip_text_agreement (Literal/None/numeric-<10/short-string)

CI trigger is flaky for now; local tests green (77 passed, ruff clean).

Pure functions that compose into the ReliabilityStep (task 1.8). Every rule is directly unit-testable. - `normalize_string` (NFKC + casefold + strip punctuation + collapse whitespace) - `normalize_number` (canonical `-DDD.DD`; CHF/de-DE/ASCII heuristics) - `normalize_date` (dateutil dayfirst=True → ISO) - `normalize_iban` (uppercase + strip whitespace) - `should_skip_text_agreement` (Literal/None/numeric-<10/short-string) CI trigger is flaky for now; local tests green (77 passed, ruff clean).
goldstein added 1 commit 2026-04-18 08:56:40 +00:00
feat(provenance): normalisers + short-value skip rule (spec §6)
All checks were successful
tests / test (pull_request) Successful in 1m0s
tests / test (push) Successful in 1m28s
527fc620fe
Pure functions the ReliabilityStep will compose to compare extracted values
against OCR snippets (and context.texts). Kept in one module so every rule
is directly unit-testable without pulling in the step ABC.

Highlights:

- `normalize_string`: NFKC + casefold + strip common punctuation (. , : ; !
  ? () [] {} / \\ ' " `) + collapse whitespace. Substring-compatible.

- `normalize_number`: returns the canonical "[-]DDD.DD" form (always 2dp)
  after stripping currency symbols. Heuristic separator detection handles
  Swiss-German apostrophes ("1'234.56"), de-DE commas ("1.234,56"), and
  plain ASCII ("1234.56" / "1234.5" / "1234"). Accepts native int/float/
  Decimal as well as str.

- `normalize_date`: dateutil parse with dayfirst=True → ISO YYYY-MM-DD.
  Date and datetime objects short-circuit to their isoformat().

- `normalize_iban`: uppercase + strip whitespace. Format validation is the
  call site's job; this is pure canonicalisation.

- `should_skip_text_agreement`: dispatches on type + value. Literal → skip,
  None → skip, numeric |v|<10 → skip, len(str) ≤ 2 → skip. Numeric check
  runs first so `10` (len("10")==2) is treated on the numeric side
  (not skipped) instead of tripping the string length rule.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
goldstein merged commit 2d22115893 into main 2026-04-18 08:56:46 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: goldstein/infoxtractor#7
No description provided.