Add onboarding_report.csv generation#1097
Conversation
|
Switching this to a draft to wait for the next machine release to add |
benjaminking
left a comment
There was a problem hiding this comment.
Sorry it took so long to get to this.
@benjaminking reviewed 3 files and all commit messages, and made 5 comments.
Reviewable status: all files reviewed, 4 unresolved discussions.
silnlp/common/extract_corpora.py line 79 at r2 (raw file):
versification_error_output_path: Optional[str] = None, environment: SilNlpEnv = SilNlpEnv.create_standard_environment(), ) -> Tuple[Path | None, List[int], int, int]:
When you have a return type that gets this complex, it's usually good to think about creating a class to represent the return value. The class gives you some semantics about what the different parts mean, which helps to reduce programming errors. For example, it would be really easy to accidentally mix up the terms count and the versification error count, but less easy with a class.
silnlp/common/extract_corpora.py line 96 at r2 (raw file):
LOGGER.info(f"Identified parent project {parent_project_dir.name}.") matching, detected_versification, versification_error_count = check_versification(
It would similarly be good to have a class the represents the output of check_versification.
silnlp/common/onboard_project.py line 431 at r2 (raw file):
def generate_report(self) -> dict: self.report.name_on_bucket = self.project_name
Are you familiar with the "Builder" pattern? It's a software design pattern used for creating a complex object (like OnboardingReport), and I think it would be a good fit for what you're doing here. It would let you encapsulate the logic for building the report inside of the OnboardingReport class. Accessing class members outside of the class is always a risk for bugs as the software evolves.
silnlp/common/onboard_project.py line 493 at r2 (raw file):
self.main_project: OnboardingProject = OnboardingProject( project_name=main_project_name, project_type="Request Project",
The project type might be a good use case for an enum, since there's a small set of possible values it could take.
|
@benjaminking I forgot to mention this, but PR is a newer version of this PR. |
This PR adds a
onboarding_report.csvfile that is generated at the end of an Onboarding Request with basic information on each of the projects all compiled in one file, rather than being spread across the command line output and various other output files.OnboardingReportclass that is used by eachOnboardingProjectto store and format the information needed for the csv filegenerate_reportandcreate_reportfunctions toOnboardingProjectandOnboardingRequestrespectively to create the csv file using the values from theOnboardingReportfor each project in the Requestextract_corporaandcheck_versificationto pass alongdetected_versification,versification_error_count, andterms_countto theOnboardingProject` for the Onboarding Report.This PR also addresses two small bugs:
corpus_stats.csvtocorpus-stats.csvThis change is