Skip to content

Add parsers for YES BANK, Suryoday SFB, and Kotak Mahindra Bank#8

Open
anshuman852 wants to merge 6 commits into
akhilnarang:masterfrom
anshuman852:master
Open

Add parsers for YES BANK, Suryoday SFB, and Kotak Mahindra Bank#8
anshuman852 wants to merge 6 commits into
akhilnarang:masterfrom
anshuman852:master

Conversation

@anshuman852

Copy link
Copy Markdown
Contributor

No description provided.

Add a bank-specific parser and detection path for YES BANK statements to prevent summary/header rows from being parsed as transactions and to correctly classify Dr/Cr entries.
Add SSFB-specific detection and parsing for statement summary fields while isolating MITC/example-page content so reconciliation remains stable.
… tests

Correctness:
- Rename shadowed `words` variable to `name_parts` in name extraction
- Tighten is_credit check from {"CR", "C"} to == "CR" to avoid
  false positives from single-letter tokens
- Use \s+ instead of \n in SSFB summary regex for PyMuPDF version
  robustness
- Add comments explaining intentional (Dr|Cr) group ignoring in
  Purchases and Payments summary regexes
- Add comment flagging SSFB hardcoded empty transactions list

Style/maintainability:
- Move _EXCLUDED_WORDS to module scope (consistent with _MERCHANT_CATEGORIES)
- Extract _format_card_number() helper to deduplicate card number
  formatting (was inlined 3 times)
- Remove unused detected_members variable and debug dict entry

Tests:
- Add BankOption and factory detection tests for ssfb and yesbank
- Add parser-contract smoke tests for both parsers
- Fix pre-existing test_browser.py bank count (12 -> 14)
- Remove unused parse_amount import
- Remove unused multi_letter_words variable (populated but never read)
- Fix email camel-case splitter: run ([a-z])([A-Z]) regex before .upper()
  so it can actually match mixed-case email local parts
- Fix _MERCHANT_CATEGORIES comment to reflect that categories are
  stripped from narration, not included
- Reapply merchant category and Dr/Cr stripping after context merge
  rebuilds narration from raw tokens (prevents re-introduction)
- Remove unused current_member variable and its references
Copilot AI review requested due to automatic review settings April 22, 2026 13:00

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Kotak Mahindra Bank support to the credit-card statement parsing system by wiring a new parser into the registry, detection factory, and CLI/browser surfaces, along with contract tests.

Changes:

  • Introduces a new KotakParser implementation and registers it in the parser registry.
  • Extends bank auto-detection rules and CLI enum to include the new kotak slug.
  • Updates contract/browser tests to cover Kotak availability, detection, and minimal parsing.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
cc_parser/parsers/kotak.py Adds the Kotak statement parsing implementation (name/due date/TAD + transactions).
cc_parser/parsers/registry.py Registers kotak parser in the stable registry ordering.
cc_parser/parsers/factory.py Adds a KOTAK detection rule and updates detect_bank docstring.
cc_parser/cli.py Exposes kotak via BankOption and updates parse_statement docstring.
tests/test_contracts.py Adds contract coverage for Kotak exposure, detection, and smoke parsing.
tests/test_browser.py Updates browser bank list test to include kotak and new count.
AGENTS.md Documents the new Kotak parser profile entry.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/test_contracts.py
self.assertEqual(result.file, "test.pdf")
self.assertEqual(result.name, "JOHN DOE")
self.assertEqual(result.due_date, "08/05/2026")
self.assertEqual(result.statement_total_amount_due, "4186.50")

Copilot AI Apr 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test expects statement_total_amount_due to be "4186.50" even though the input text contains a comma ("4,186.50"). The Kotak parser currently uses normalize_amount(), which does not remove thousands separators, so it will return "4,186.50" and this assertion will fail. Align the expectation with existing conventions (e.g., keep commas like the Equitas test) or change the parser to explicitly strip commas for this field (and ensure it’s consistent across parsers).

Suggested change
self.assertEqual(result.statement_total_amount_due, "4186.50")
self.assertEqual(result.statement_total_amount_due, "4,186.50")

Copilot uses AI. Check for mistakes.
Comment on lines +65 to +79
def _extract_name(first_page_text: str) -> str | None:
"""Extract cardholder name near the top of page 1."""
# Format: "Ansuman Mishra Monthly statement for your League Credit Card X3188"
match = re.search(
r"^([A-Z][a-z]+(?:\s+[A-Z][a-z]+)*)\s+Monthly\s+statement",
first_page_text,
re.MULTILINE,
)
if not match:
return None
candidate = " ".join(match.group(1).split()).strip().upper()
parts = candidate.split()
if 2 <= len(parts) <= 6:
return candidate
return None

Copilot AI Apr 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_extract_name() only matches Title Case names ([A-Z][a-z]+ ...) but the new Kotak contract test input uses an ALL-CAPS name ("JOHN DOE"), so name will be None and the test/parsed output will not match. Update the regex to also accept all-caps (and/or mixed-case) names before "Monthly statement", while still guarding against overly long/short matches.

Copilot uses AI. Check for mistakes.
Comment on lines +26 to +33
from cc_parser.parsers.tokens import (
format_amount,
normalize_amount,
parse_amount,
parse_amount_token,
sum_amounts,
sum_points,
)

Copilot AI Apr 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_amount is imported from cc_parser.parsers.tokens but never used in this module. Please remove the unused import to avoid lint noise and keep imports minimal.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants