Add parsers for YES BANK, Suryoday SFB, and Kotak Mahindra Bank#8
Add parsers for YES BANK, Suryoday SFB, and Kotak Mahindra Bank#8anshuman852 wants to merge 6 commits into
Conversation
Add a bank-specific parser and detection path for YES BANK statements to prevent summary/header rows from being parsed as transactions and to correctly classify Dr/Cr entries.
Add SSFB-specific detection and parsing for statement summary fields while isolating MITC/example-page content so reconciliation remains stable.
… tests
Correctness:
- Rename shadowed `words` variable to `name_parts` in name extraction
- Tighten is_credit check from {"CR", "C"} to == "CR" to avoid
false positives from single-letter tokens
- Use \s+ instead of \n in SSFB summary regex for PyMuPDF version
robustness
- Add comments explaining intentional (Dr|Cr) group ignoring in
Purchases and Payments summary regexes
- Add comment flagging SSFB hardcoded empty transactions list
Style/maintainability:
- Move _EXCLUDED_WORDS to module scope (consistent with _MERCHANT_CATEGORIES)
- Extract _format_card_number() helper to deduplicate card number
formatting (was inlined 3 times)
- Remove unused detected_members variable and debug dict entry
Tests:
- Add BankOption and factory detection tests for ssfb and yesbank
- Add parser-contract smoke tests for both parsers
- Fix pre-existing test_browser.py bank count (12 -> 14)
- Remove unused parse_amount import - Remove unused multi_letter_words variable (populated but never read) - Fix email camel-case splitter: run ([a-z])([A-Z]) regex before .upper() so it can actually match mixed-case email local parts - Fix _MERCHANT_CATEGORIES comment to reflect that categories are stripped from narration, not included - Reapply merchant category and Dr/Cr stripping after context merge rebuilds narration from raw tokens (prevents re-introduction) - Remove unused current_member variable and its references
There was a problem hiding this comment.
Pull request overview
Adds Kotak Mahindra Bank support to the credit-card statement parsing system by wiring a new parser into the registry, detection factory, and CLI/browser surfaces, along with contract tests.
Changes:
- Introduces a new
KotakParserimplementation and registers it in the parser registry. - Extends bank auto-detection rules and CLI enum to include the new
kotakslug. - Updates contract/browser tests to cover Kotak availability, detection, and minimal parsing.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
cc_parser/parsers/kotak.py |
Adds the Kotak statement parsing implementation (name/due date/TAD + transactions). |
cc_parser/parsers/registry.py |
Registers kotak parser in the stable registry ordering. |
cc_parser/parsers/factory.py |
Adds a KOTAK detection rule and updates detect_bank docstring. |
cc_parser/cli.py |
Exposes kotak via BankOption and updates parse_statement docstring. |
tests/test_contracts.py |
Adds contract coverage for Kotak exposure, detection, and smoke parsing. |
tests/test_browser.py |
Updates browser bank list test to include kotak and new count. |
AGENTS.md |
Documents the new Kotak parser profile entry. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| self.assertEqual(result.file, "test.pdf") | ||
| self.assertEqual(result.name, "JOHN DOE") | ||
| self.assertEqual(result.due_date, "08/05/2026") | ||
| self.assertEqual(result.statement_total_amount_due, "4186.50") |
There was a problem hiding this comment.
The test expects statement_total_amount_due to be "4186.50" even though the input text contains a comma ("4,186.50"). The Kotak parser currently uses normalize_amount(), which does not remove thousands separators, so it will return "4,186.50" and this assertion will fail. Align the expectation with existing conventions (e.g., keep commas like the Equitas test) or change the parser to explicitly strip commas for this field (and ensure it’s consistent across parsers).
| self.assertEqual(result.statement_total_amount_due, "4186.50") | |
| self.assertEqual(result.statement_total_amount_due, "4,186.50") |
| def _extract_name(first_page_text: str) -> str | None: | ||
| """Extract cardholder name near the top of page 1.""" | ||
| # Format: "Ansuman Mishra Monthly statement for your League Credit Card X3188" | ||
| match = re.search( | ||
| r"^([A-Z][a-z]+(?:\s+[A-Z][a-z]+)*)\s+Monthly\s+statement", | ||
| first_page_text, | ||
| re.MULTILINE, | ||
| ) | ||
| if not match: | ||
| return None | ||
| candidate = " ".join(match.group(1).split()).strip().upper() | ||
| parts = candidate.split() | ||
| if 2 <= len(parts) <= 6: | ||
| return candidate | ||
| return None |
There was a problem hiding this comment.
_extract_name() only matches Title Case names ([A-Z][a-z]+ ...) but the new Kotak contract test input uses an ALL-CAPS name ("JOHN DOE"), so name will be None and the test/parsed output will not match. Update the regex to also accept all-caps (and/or mixed-case) names before "Monthly statement", while still guarding against overly long/short matches.
| from cc_parser.parsers.tokens import ( | ||
| format_amount, | ||
| normalize_amount, | ||
| parse_amount, | ||
| parse_amount_token, | ||
| sum_amounts, | ||
| sum_points, | ||
| ) |
There was a problem hiding this comment.
parse_amount is imported from cc_parser.parsers.tokens but never used in this module. Please remove the unused import to avoid lint noise and keep imports minimal.
No description provided.