feat(provenance): normalisers + short-value skip rule (spec §6) #7
Loading…
Reference in a new issue
No description provided.
Delete branch "feat/provenance-normalize"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Pure functions that compose into the ReliabilityStep (task 1.8). Every rule is directly unit-testable.
normalize_string(NFKC + casefold + strip punctuation + collapse whitespace)normalize_number(canonical-DDD.DD; CHF/de-DE/ASCII heuristics)normalize_date(dateutil dayfirst=True → ISO)normalize_iban(uppercase + strip whitespace)should_skip_text_agreement(Literal/None/numeric-<10/short-string)CI trigger is flaky for now; local tests green (77 passed, ruff clean).
Pure functions the ReliabilityStep will compose to compare extracted values against OCR snippets (and context.texts). Kept in one module so every rule is directly unit-testable without pulling in the step ABC. Highlights: - `normalize_string`: NFKC + casefold + strip common punctuation (. , : ; ! ? () [] {} / \\ ' " `) + collapse whitespace. Substring-compatible. - `normalize_number`: returns the canonical "[-]DDD.DD" form (always 2dp) after stripping currency symbols. Heuristic separator detection handles Swiss-German apostrophes ("1'234.56"), de-DE commas ("1.234,56"), and plain ASCII ("1234.56" / "1234.5" / "1234"). Accepts native int/float/ Decimal as well as str. - `normalize_date`: dateutil parse with dayfirst=True → ISO YYYY-MM-DD. Date and datetime objects short-circuit to their isoformat(). - `normalize_iban`: uppercase + strip whitespace. Format validation is the call site's job; this is pure canonicalisation. - `should_skip_text_agreement`: dispatches on type + value. Literal → skip, None → skip, numeric |v|<10 → skip, len(str) ≤ 2 → skip. Numeric check runs first so `10` (len("10")==2) is treated on the numeric side (not skipped) instead of tripping the string length rule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>