Skip to content

5566 Document parsing/reparsing workflow#5821

Open
raftmsohani wants to merge 7 commits into
developfrom
5566-document-current-parsing-reparsing-flows
Open

5566 Document parsing/reparsing workflow#5821
raftmsohani wants to merge 7 commits into
developfrom
5566-document-current-parsing-reparsing-flows

Conversation

@raftmsohani
Copy link
Copy Markdown

@raftmsohani raftmsohani commented May 1, 2026

Summary of Changes

Provide a brief summary of changes
Pull request closes #5566

How to Test

List the steps to test the PR
These steps are generic, please adjust as necessary.

cd tdrs-frontend && docker-compose up --build
cd tdrs-backend && docker-compose up --build
  1. Open http://localhost:3000/ and sign in.
  2. Proceed with functional tests as described herein.
  3. Test steps should be captured in the demo GIF(s) and/or screenshots below.

Demo GIF(s) and screenshots for testing procedure

Deliverables

More details on how deliverables herein are assessed included here.

Deliverable 1: Accepted Features

Checklist of ACs:

  • [insert ACs here]
  • lfrohlich and/or adpennington confirmed that ACs are met.

Deliverable 2: Tested Code

  • Are all areas of code introduced in this PR meaningfully tested?
    • If this PR introduces backend code changes, are they meaningfully tested?
    • If this PR introduces frontend code changes, are they meaningfully tested?
  • Are code coverage minimums met?
    • Frontend coverage: [insert coverage %] (see CodeCov Report comment in PR)
    • Backend coverage: [insert coverage %] (see CodeCov Report comment in PR)

Deliverable 3: Properly Styled Code

  • Are backend code style checks passing on CircleCI?
  • Are frontend code style checks passing on CircleCI?
  • Are code maintainability principles being followed?

Deliverable 4: Accessible

  • Does this PR complete the epic?
  • Are links included to any other gov-approved PRs associated with epic?
  • Does PR include documentation for Raft's a11y review?
  • Did automated and manual testing with iamjolly and ttran-hub using Accessibility Insights reveal any errors introduced in this PR?

Deliverable 5: Deployed

  • Was the code successfully deployed via automated CircleCI process to development on Cloud.gov?

Deliverable 6: Documented

  • Does this PR provide background for why coding decisions were made?
  • If this PR introduces backend code, is that code easy to understand and sufficiently documented, both inline and overall?
  • If this PR introduces frontend code, is that code easy to understand and sufficiently documented, both inline and overall?
  • If this PR introduces dependencies, are their licenses documented?
  • Can reviewer explain and take ownership of these elements presented in this code review?

Deliverable 7: Secure

  • Does the OWASP Scan pass on CircleCI?
  • Do manual code review and manual testing detect any new security issues?
  • If new issues detected, is investigation and/or remediation plan documented?

Deliverable 8: User Research

Research product(s) clearly articulate(s):

  • the purpose of the research
  • methods used to conduct the research
  • who participated in the research
  • what was tested and how
  • impact of research on TDP
  • (if applicable) final design mockups produced for TDP development

@raftmsohani raftmsohani self-assigned this May 1, 2026
@raftmsohani raftmsohani linked an issue May 1, 2026 that may be closed by this pull request
12 tasks
@codecov
Copy link
Copy Markdown

codecov Bot commented May 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.98%. Comparing base (12a2679) to head (d91ef10).
⚠️ Report is 24 commits behind head on develop.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff            @@
##           develop    #5821   +/-   ##
========================================
  Coverage    93.98%   93.98%           
========================================
  Files          536      536           
  Lines        24527    24527           
  Branches       620      620           
========================================
  Hits         23051    23051           
  Misses        1363     1363           
  Partials       113      113           
Flag Coverage Δ
dev-backend 94.26% <ø> (ø)
dev-frontend 91.84% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 17c8db8...d91ef10. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

@jtimpe jtimpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only did a high-level pass here, I didn't go verify all the details. That said, a couple points of feedback:

  • I don't find the charts to be particularly useful, personally. I'm not quite sure how to read them, and there's a lot of complexity to parse through. The text flows make more sense to me.
  • There's a lot of implementation details represented that I feel are unnecessary. Class names and files might be unavoidable, but references to kwargs, try/except, function/method names, etc. seem too in-the-weeds.

The way it is currently written is about as cognitively demanding as reading the code itself. Plus, as we make changes to these implementation details, we have to meticulously update the documentation alongside or it will go out of date. I'd prefer the documentation to cover the structure, behavior rules, and the "why" behind the implementation, rather than cover the implementation with a lot of detail. Perhaps including some examples of how certain structures or validators get used could be helpful.

Open to conversation and opinions on this, those are just my initial thoughts.

│ └── DataFile.create_new_version(...) creates the DataFile with file=None (state = UPLOADED)
├── transition_datafile → VIRUS_SCAN_STARTED
├── ClamAVClient.scan_file(...) ← synchronous, in-request scan
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't ClamAV scanning happen before DataFileSerializer.create()? or the create calls DataFile.create_new_version which calls the scan, but a scan failure blocks DF creation. Not sure how that should be represented here in terms of calls - but both this and the diagram above indicate (to me, anyway) that it's parallel or happens after model creation.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have changed that behavior to ensure state can be stored. We first create the datafile but not storing any file. Then ClamAV scans during the request. If scan fails, the DataFile remains in failed scan state for lifecycle visibility, but the uploaded file is not stored and no parse task is queued.

Comment thread docs/Technical-Documentation/parsing-reparsing-architecture.md Outdated
```
ParserFactory.get_instance(**kwargs)
├── pops program_type, is_program_audit from kwargs
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the note about kwargs seems too implementation-detail-heavy to me. if we ever change from kwargs to args or some other method of passing the data, we have to go update the documentation

Co-authored-by: jtimpe <111305129+jtimpe@users.noreply.github.com>
@raftmsohani
Copy link
Copy Markdown
Author

I only did a high-level pass here, I didn't go verify all the details. That said, a couple points of feedback:

  • I don't find the charts to be particularly useful, personally. I'm not quite sure how to read them, and there's a lot of complexity to parse through. The text flows make more sense to me.
  • There's a lot of implementation details represented that I feel are unnecessary. Class names and files might be unavoidable, but references to kwargs, try/except, function/method names, etc. seem too in-the-weeds.

The way it is currently written is about as cognitively demanding as reading the code itself. Plus, as we make changes to these implementation details, we have to meticulously update the documentation alongside or it will go out of date. I'd prefer the documentation to cover the structure, behavior rules, and the "why" behind the implementation, rather than cover the implementation with a lot of detail. Perhaps including some examples of how certain structures or validators get used could be helpful.

Open to conversation and opinions on this, those are just my initial thoughts.

I kind of agree with your observation Jan. The intention of this documentation for now was to document the functionality before we do any changes to parsing/reparsing, but I agree we should use the code as detail documentation and use this document as high level documentation for the user to understand better the flow.

@elipe17
Copy link
Copy Markdown

elipe17 commented May 14, 2026

I think we should also use this as an opportunity to clean up some old documentation/diagrams. I'd like to propose removing clean-and-reparse.md, create-elastic-kibana.md, nexus-repo.md, and parsing-flow.md. It would also probably be advantageous to update/delete the resources in docs/Technical-Documentation/diagrams. What are your guys thoughts? @jtimpe @mattcoleanderson @raftmsohani

Copy link
Copy Markdown

@jtimpe jtimpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is coming along very nicely! I like the behavior-documentation approach much better than implementation details. I think this could use a bit more detail on

  • the fixed-width vs columnar files/decoders (it's mentioned briefly in the FRA validation example but that's all i see)
  • how the task selects which parser to use based on the program/section (might be hard to do without providing a lot of implementation detail, feel free to adjust how you see fit)
  • schema definitions and how validators are defined, as well as the order or operations for the different validation layers

Overall, looking very good!

4. It applies cross-record rules such as case consistency and duplicate handling.
5. It computes the final `DataFileSummary.status`.
6. It maps the parser outcome back onto `DataFile.state`.
7. It generates an error report and stores it on the summary.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
7. It generates an error report and stores it on the summary.
7. It generates an error report and performs aggregate calculations according to the file type, stores them on the summary.

@elipe17
Copy link
Copy Markdown

elipe17 commented May 28, 2026

I think we should also use this as an opportunity to clean up some old documentation/diagrams. I'd like to propose removing clean-and-reparse.md, create-elastic-kibana.md, nexus-repo.md, and parsing-flow.md. It would also probably be advantageous to update/delete the resources in docs/Technical-Documentation/diagrams. What are your guys thoughts? @jtimpe @mattcoleanderson @raftmsohani

Did we decide to do anything about this @raftmsohani ? What are your thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Document current parsing & reparsing flows

3 participants