Skip to content

Go Parser Validator Audit#5834

Merged
elipe17 merged 18 commits into
developfrom
5728-go-validator-audit
May 21, 2026
Merged

Go Parser Validator Audit#5834
elipe17 merged 18 commits into
developfrom
5728-go-validator-audit

Conversation

@elipe17
Copy link
Copy Markdown

@elipe17 elipe17 commented May 6, 2026

Summary of Changes

  • Aligned remaining yaml schemas and file specs with python
  • Added go equivalent integrations tests for most of the remaining python parser integration tests
  • Added to and updated validators in validators.yaml where needed
  • Updated the grouping KeyFields config to allow for completely generic key field specification
  • Updated Decoder.Row to know how to extract it's key fields based on the config
  • Update tabular decoders to mirror python logic for reading the first row of an FRA datafile for a "header" check
  • Added new sentinel error for decoder unknown parser errors
  • Updated the sortable interface to leverage Decoder.Row interface instead of the thin adapter we had before (KeyFieldExtractor)
  • Updated accumulator/pipeline/orchestrators to adhere to new Decoder.Row interface
  • Fixed bug around rollbacks and parsing cancellation that was preventing sentinel errors from being written
  • Add FRA sections for celery server with a TODO caveat
  • Updated validation key to be schema path instead of record type since TANF and Tribal would have conflicted
  • Fixed bug where schemas with multiple segments could duplicate field validators across segment fields
  • Left some other TODOs that I will turn into follow on tickets
    Pull request closes Go Parser: Audit validators against Python parser for completeness #5728

How to Test

cd tdrs-frontend && docker-compose up --build
cd tdrs-backend && docker-compose up --build
  1. Submit pretty much any of the test files to the go parser and audit the results
  2. Run the tests
  3. Try and identify other gaps that I have missed that can be turned into follow on tickets

Deliverables

More details on how deliverables herein are assessed included here.

Deliverable 1: Accepted Features

Checklist of ACs:

  • Every preparsing_validator in Python schema_defs has a corresponding Go validator
  • Every field validator in Python schema_defs has a corresponding Go validator
  • Every postparsing_validator in Python schema_defs has a corresponding Go validator
  • Checklist produced documenting any gaps found
  • All identified gaps are either resolved or tracked as separate issues
  • Testing Checklist has been run and all tests pass
  • README is updated, if necessary

Deliverable 2: Tested Code

  • Are all areas of code introduced in this PR meaningfully tested?
    • If this PR introduces backend code changes, are they meaningfully tested?
    • If this PR introduces frontend code changes, are they meaningfully tested?
  • Are code coverage minimums met?
    • Frontend coverage: [insert coverage %] (see CodeCov Report comment in PR)
    • Backend coverage: [insert coverage %] (see CodeCov Report comment in PR)

Deliverable 3: Properly Styled Code

  • Are backend code style checks passing on CircleCI?
  • Are frontend code style checks passing on CircleCI?
  • Are code maintainability principles being followed?

Deliverable 4: Accessible

  • Does this PR complete the epic?
  • Are links included to any other gov-approved PRs associated with epic?
  • Does PR include documentation for Raft's a11y review?
  • Did automated and manual testing with iamjolly and ttran-hub using Accessibility Insights reveal any errors introduced in this PR?

Deliverable 5: Deployed

  • Was the code successfully deployed via automated CircleCI process to development on Cloud.gov?

Deliverable 6: Documented

  • Does this PR provide background for why coding decisions were made?
  • If this PR introduces backend code, is that code easy to understand and sufficiently documented, both inline and overall?
  • If this PR introduces frontend code, is that code easy to understand and sufficiently documented, both inline and overall?
  • If this PR introduces dependencies, are their licenses documented?
  • Can reviewer explain and take ownership of these elements presented in this code review?

Deliverable 7: Secure

  • Does the OWASP Scan pass on CircleCI?
  • Do manual code review and manual testing detect any new security issues?
  • If new issues detected, is investigation and/or remediation plan documented?

Deliverable 8: User Research

Research product(s) clearly articulate(s):

  • the purpose of the research
  • methods used to conduct the research
  • who participated in the research
  • what was tested and how
  • impact of research on TDP
  • (if applicable) final design mockups produced for TDP development

elipe17 added 9 commits May 5, 2026 09:45
- Update FRA and Tribal schemas with validators
- Add tests surrounding FRA and Tribal
- Get rid of SerializeHeaderError in favor of generic SerializeParserError
- Remove extra field on m3
- fix tanf t7 schema
- Update validator registry to not duplicate validators accross the same segment fields
- Make required field validation easier to read
…n calendar quarter

- Add record calendar quarter validator
- Fix mis alignments in tribal validators
- Update incorrect python test and added comment justification
… fields

- Add interface to decoder row to allow the row to dictate how to extract it's key fields for grouping and sorting
- Update sortable to accept key fields instead of an extractor
- Update accumulator to use new row interface
- Clean out old static key fields config throughout
…ed record counts by running python version and updated python record counts too.

- Add new sentinel error for decoder unknown
- Add unit tests for server cover decoder unknown path
@elipe17 elipe17 self-assigned this May 6, 2026
@elipe17 elipe17 added backend dev raft review This issue is ready for raft review labels May 6, 2026
elipe17 added 8 commits May 5, 2026 21:59
- Update file specs to use the correct cat4 validators
- Add duplicate detection to s3 and s4 files
- Rewrite requires_related_record & requires_related_record_with_int_value to execute per record
- Remove unused compiled functions
- remove extra field from tests
- Disable keycloak sync for all other tests
@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

Codecov Report

❌ Patch coverage is 81.25000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 94.01%. Comparing base (4c3f482) to head (23cf207).
⚠️ Report is 34 commits behind head on develop.

Files with missing lines Patch % Lines
.../tdpservice/parsers/test/test_parse_large_files.py 0.00% 3 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #5834      +/-   ##
===========================================
+ Coverage    94.00%   94.01%   +0.01%     
===========================================
  Files          538      538              
  Lines        24650    24819     +169     
  Branches       620      620              
===========================================
+ Hits         23171    23334     +163     
- Misses        1366     1372       +6     
  Partials       113      113              
Flag Coverage Δ
dev-backend 94.30% <81.25%> (+0.01%) ⬆️
dev-frontend 91.84% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
tdrs-backend/tdpservice/conftest.py 94.47% <100.00%> (+0.08%) ⬆️
tdrs-backend/tdpservice/core/test/test_admin.py 100.00% <ø> (ø)
tdrs-backend/tdpservice/parsers/test/conftest.py 97.72% <ø> (ø)
tdrs-backend/tdpservice/parsers/test/test_parse.py 99.78% <100.00%> (-0.01%) ⬇️
...ackend/tdpservice/users/test/test_keycloak_sync.py 100.00% <100.00%> (ø)
.../tdpservice/parsers/test/test_parse_large_files.py 80.00% <0.00%> (ø)

... and 4 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4740d5c...23cf207. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Base automatically changed from 5727-go-parser-rollback-handling to develop May 13, 2026 04:47
Copy link
Copy Markdown

@jtimpe jtimpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested most of the submissions by uploading them through the frontend to the python parser, then running them through the go parser in dry-run mode and comparing the outputs. Only noticed a few small inconsistencies

expected_t3_record_count = 172525
expected_t1_record_count = 96497
expected_t2_record_count = 112622
expected_t3_record_count = 172552
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did the record counts change?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the counts the python parser reported when I ran it. I don't remember the last time we actaully enabled this test. My best guess is that we introduced new record precheck or case consistency checks that changed the number.

params:
record_type: T6

validation_orchestrator:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to be defined for every section file? they should be the same across any of the submissions, no?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it does right now. We could setup a default definition for it to keep it DRYer. The reason I didn't do that initially is because we don't have a schema def/document for the complete shape of our yaml files yet. So everything we default and leave out is obscurred.

# Education level validator - 0-16 or 98-99
- id: education_level
expr: "(int(Value) >= 0 and int(Value) <= 16) or (int(Value) >= 98 and int(Value) <= 99)"
expr: "(int(trim(string(Value))) >= 0 and int(trim(string(Value))) <= 16) or (int(trim(string(Value))) >= 98 and int(trim(string(Value))) <= 99)"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe defining field type: integer could do some sort of pre-validation transformation so that int(trim(string(Value))) doesn't have to be done so many times?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I really want to update all the schemas to treat actual integer values as integers at parse time instead of a string and then converting. I think this would save us a bunch of bandwidth, but it requires major migrations.

@@ -73,12 +73,13 @@ accumulator:
# These byte positions are used BEFORE full parsing to quickly
# determine which case a record belongs to.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TANF section 2 files seem to have a bit of drift in the number of parser errors generated (638 for celery parser, 582 for go parser using the file ADS.E2J.FTP2.TS06)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will dig in and see what the extra errors are that the python parser is generating.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtimpe I have found the reason why we have different number of errors between the two. The difference between the two sets came down to 2 different validators. The first is a difference between pythons category3.isOlderThan(18) and the go parser equivalent age calculation. The go parser actually uses the correct age calculation (the one used in the AGE_FIRST view calculation cc @mattcoleanderson ). Because of this, around half of the extra errors python generates that go doesn't is because they are actually valid from go's perspective/don't actually trigger the category 3 validation.

The second issue is that python generates the deprecated category four errors still. Go doesn't even implement them since they are going away.


# Tribal TANF allows WORK_PART_STATUS 01-03, 05-09, 11-19, or 99.
- id: tribal_t2_family_affil_1_2_work_part_status
expr: "if GetInt('FAMILY_AFFILIATION') in [1, 2] { (GetInt('WORK_PART_STATUS') in 1..3) or (GetInt('WORK_PART_STATUS') in 5..9) or (GetInt('WORK_PART_STATUS') in 11..19) or GetString('WORK_PART_STATUS') == '99' } else { true }"
Copy link
Copy Markdown

@jtimpe jtimpe May 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two things about this validator, i assume this applies to any validators of this style

  • the field_name and item_number don't seem to be reported (at least in the dry run, the column comes back <nil>)
  • the python parser reports the given value for FAMILY_AFFILIATION (Since Item 26 (Family Affiliation) is 1...)

@jtimpe
Copy link
Copy Markdown

jtimpe commented May 15, 2026

I tested most of the submissions by uploading them through the frontend to the python parser, then running them through the go parser in dry-run mode and comparing the outputs. Only noticed a few small inconsistencies

apologies - please disregard my review comments for the time being. i need to re-verify everything as i believe i tested the wrong branch 🤦

@jtimpe
Copy link
Copy Markdown

jtimpe commented May 15, 2026

@elipe17 i tested again and deleted the comments that were resolved. my only remaining blocker is the tanf s2 error count drift. otherwise looks great!

@elipe17 elipe17 requested a review from jtimpe May 20, 2026 16:37
@elipe17 elipe17 merged commit abc9541 into develop May 21, 2026
14 checks passed
@elipe17 elipe17 deleted the 5728-go-validator-audit branch May 21, 2026 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend dev raft review This issue is ready for raft review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Go Parser: Audit validators against Python parser for completeness

3 participants