All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- CC statement support (
#28, #29, #31–#34, PR #122) — full credit card pipeline for paid tier:ExtractionResult.card_number,BankTemplate.column_aliases,CCGroupingServicegrouping by last-4 card suffix, wired intoServiceRegistry.group_by_card().processor.run()splits oncard_number is Noneto route bank vs CC results. aib_credit_card.jsontemplate (#129, PR #138) — correct CC column boundaries (Transaction Date 29–80,Posting Date 80–118,Transaction Details 118–370,Amount 370–430) soRefContinuationClassifierandRowMergerServicehandle CC two-line transaction splits correctly without falling back to bank columns.PDFExtractorOptionsdataclass (#109, PR #139) — groups 8 optionalPDFTableExtractorconstructor params into a single options object, reducing the constructor from 11 params to 3.- Pylint design gate in CI (
#85, PR #104) — Xenon complexity gate and Pylint design checks added to CI pipeline.
- #129 — Non-transaction (empty/phantom) rows in CC CSV/JSON output eliminated. Root cause: missing CC template caused
Ref:lines to be misclassified as transactions. Fixed by addingaib_credit_card.json(PR #138) and an earlier classifier fix (PR #133). - #131 — CC amounts ending in
CRnow populate the Credit column instead of Debit.reroute_cr_suffix()added tocurrency.py, wired viaRowPostProcessor._reroute_cr_amounts()(PR #135). - #132 — CC transactions sorted incorrectly due to yearless dates. Year inferred from
Payment Duedate in statement; ordinal date suffixes added to_PAYMENT_DUE_PATTERNS(PR #133). - #134 — CC output dates now include the statement year (e.g.
4 Feb 2025).Transaction._enrich_date()appends year viato_dict()(PR #137). - #125 — Unknown-IBAN group was producing output files instead of routing to
excluded_files.json(PR #126). - #123 — Free-tier pipeline was producing CC grouped output files instead of routing to
excluded_files.json. CC grouping now gated behind paid-tier entitlement check (PR #124). - #106 — Credit card PDFs were unconditionally skipped on the paid tier; now correctly processed (PR #108).
- #110 —
data_retention_dayswas not forwarded toDataRetentionService(PR #115). - #78 —
date_propagatedextraction warnings suppressed from JSON/CSV output (PR #93). - #90 — 214 logging f-string violations (G004) replaced with
%-formatting (PR #100). - #98 —
_detect_text_based_tabledecomposed to pass Xenon C complexity gate (PR #101). - #80 — Pre-existing unused imports (F401) removed from test files (PR #94).
- Service layer migrated to
list[Transaction](#71, PR #79) — all services accept/returnlist[Transaction]; dict round-trips removed. Output boundary conversion viatransactions_to_dicts(currency_symbol=""). - Currency-agnostic field names (
#62–#64, #66, PR #67) —TransactionRowfields renamed_EUR → _AMT;strip_currency_symbols()unified indomain/currency.py;currency_symboldefaults to""throughout. ruffreplacesflake8(#84, #89, PR #91) — ruff lint config in bothpyproject.tomlfiles; pre-commit hook updated toastral-sh/ruff-pre-commit v0.8.0.pip-auditreplacessafety(#86, PR #97) — dependency vulnerability scanning updated.- Hadolint + pinned
trivy-actionadded to CI (#87, PR #96). [skip downstream]support added to dispatch-downstream CI job (PR #82).#111, #112, #113— Dead fields removed fromExtractionConfigandExtractionScoringConfig; deadscoring_configparam removed fromPDFTableExtractor(PRs #116, #117).- CONTRIBUTING.md — Coverage threshold corrected to 91% to match
pyproject.toml(PR #140).
- #47 —
filter_service.apply_all_filters()result was computed and logged but silently discarded. Filtered rows are now written back toresult.transactionsinPDFProcessingOrchestrator.process_all_pdfs(), sofilter_empty_rows,filter_header_rows, andfilter_invalid_datesare applied to every successfully extracted PDF. - #52 —
BankStatementProcessorBuilder.with_duplicate_strategy()and.with_date_sorting()were inert:build()calledServiceRegistry.from_config()with no services, causing the registry to create its own defaults and silently ignore the configured strategy. The builder now constructsDuplicateDetectionServiceandTransactionSortingServicefrom its configured values and passes them explicitly intoServiceRegistry.from_config(). - #55 — Credit card / no-IBAN PDFs excluded from the
pdfs_extractedcount in processing output.process_all_pdfs()now returns a 3-tuple(results, pdf_count, pages_read).
- #49 —
ChronologicalSortingStrategysorts dicts directly viaDateParserService, removing a redundantTransactionround-trip. - #48 — Deferred circular imports in
processor.pyremoved;service_registry,monthly_summary, andexpense_analysisimportColumnAnalysisService/DateParserServicedirectly at module level. - #50 —
TransactionClassifier._looks_like_datedelegates toRowAnalysisService.looks_like_date, removing a duplicate regex and fixing a subtle 1-or-2-digit day matching bug. - #51 —
ProcessorFactory.create_from_config()buildsProcessorConfigin one block viaBankStatementProcessorBuilder.with_processor_config(); new config knobs now touch ≤2 files.
- Transaction enrichment (
source_page: int | None,confidence_score: float,extraction_warnings: list[str]) — all three fields default correctly and surviveto_dict/from_dictround-trips (#16 / Phase 21). ExtractionResultdataclass (domain/models/extraction_result.py) — typed extraction boundary withtransactions,page_count,iban,source_file, andwarningsfields. Architecture guard test enforces placement indomain/models/(#16 / Phase 22).- End-to-end
ExtractionResultpipeline —PDFTableExtractor.extract(),ExtractionOrchestrator,PDFProcessingOrchestrator, andprocessorall produce and consumeExtractionResult; zero tuple-index unpacking remains (#16 / Phase 23). extraction/word_utils.py— canonical module forgroup_words_by_y,assign_words_to_columns(withstrict_rightmostflag), andcalculate_column_coverage. Five callers migrated; four private duplicate methods deleted (#21 / Phase 24).
- ServiceRegistry introduced (
feat/28, PR #44) —ServiceRegistry.from_config(ProcessorConfig, Entitlements)wires all transaction-processing services.TransactionProcessingOrchestratordeleted (PR #46 / issue #45). - ClassifierRegistry with explicit integer priorities added to
row_classifiers.py(fix/29, PR #39). recursive_scandefault changedFalse → TrueinProcessingConfig,AppConfig,ProcessorBuilder, andPDFDiscoveryService;RECURSIVE_SCANenv var added todocker-compose.yml(fix/40, PR #41).ScoringConfiginjectable viaBankStatementProcessorBuilder.with_scoring_config()(feat/32, PR #36).
extraction/word_utils.pyfoundation work —RowClassifierchain injected as shared dependency (issue #17, PR #22).PDFTableExtractordecomposed intoPageHeaderAnalyser,RowBuilder, andRowPostProcessor(issue #18, PR #23).- Facade passthroughs deleted —
content_analysis_facade.py,validation_facade.py,row_classification_facade.pyremoved; service→shim circular import chain broken (issue #20, Phase 20). pdf_table_extractor.pyshim rewired to module-level singletons;pdf_extractor.pycleaned of four lazy facade imports.- Architecture guard test
test_facade_modules_deletedadded.
- Credit card templates (
aib_credit_card.json,credit_card_default.json) removed from open-source repo; credit card support is PAID tier only viarequire_iban=FalseinEntitlements.paid_tier().