feat: add detailed sanity checking#212
Conversation
☂️ Python Coverage
Overall Coverage
New Files
Modified FilesNo covered modified files...
|
There was a problem hiding this comment.
Pull Request Overview
This PR introduces a comprehensive sanity checking framework for dataset validation, replacing the previous implementation with a modular, registry-based system that organizes checks by category (structure, schema, reference, format) and provides improved reporting capabilities.
Key changes include:
- New extensible checker framework with base classes, registry, and context objects for managing dataset metadata
- Suite of 44+ modular validation checkers organized by rule category (STR, SCH, REF, FMT)
- Enhanced CLI with detailed reporting, summary tables, and result serialization
Reviewed Changes
Copilot reviewed 60 out of 60 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| t4_devkit/schema/builder.py | Added build_schema_safe function for error-safe schema building |
| t4_devkit/sanity/*.py | Core framework files (checker, context, result, registry, run, safety) |
| t4_devkit/sanity/structure/*.py | Structure validation checkers (STR001-STR009) |
| t4_devkit/sanity/schema/*.py | Schema validation checkers (SCH001-SCH006) |
| t4_devkit/sanity/reference/*.py | Reference validation checkers (REF001-REF011) |
| t4_devkit/sanity/format/*.py | Format validation checkers (FMT001-FMT018) |
| t4_devkit/cli/sanity.py | Updated CLI to use new framework with improved output |
| pyproject.toml | Added returns dependency |
| docs/schema/requirement.md | New requirements documentation |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
001bb62 to
531a48e
Compare
shekharhimanshu
left a comment
There was a problem hiding this comment.
PRありがとうございます!一点質問しました。
|
@shekharhimanshu @SamratThapa120 Let me ask the usage of <DB_PARENT>
├── dataset1
│ └── <VERSION>
│ ├── annotation
│ ├── data
| ...
├── dataset2
│ ├── annotation
│ ├── data
| ...
...It outputs a single JSON file with Do you prefer to validating only a single dataset and generating a JSON file containing the result for a single dataset? |
@ktro2828 |
@ktro2828 I agree with @shekharhimanshu, validating single dataset seems sufficient. |
|
@ktro2828 Sorry for the delay, I will finish the review by tomorrow. |
SamratThapa120
left a comment
There was a problem hiding this comment.
Thanks for the great feature. I have left a question.
|
@shekharhimanshu @SamratThapa120 Thank you guys for comments. I updated
@SamratThapa120 No worries, take your time! I'm sorry for this huge PR, too. |
shekharhimanshu
left a comment
There was a problem hiding this comment.
Thank you for this PR. LGTM! 💯
Signed-off-by: ktro2828 <kotaro.uetake@tier4.jp>
Signed-off-by: ktro2828 <kotaro.uetake@tier4.jp>
Signed-off-by: ktro2828 <kotaro.uetake@tier4.jp>
Signed-off-by: ktro2828 <kotaro.uetake@tier4.jp>
Signed-off-by: ktro2828 <kotaro.uetake@tier4.jp>
Signed-off-by: ktro2828 <kotaro.uetake@tier4.jp>
Signed-off-by: ktro2828 <kotaro.uetake@tier4.jp>
Signed-off-by: ktro2828 <kotaro.uetake@tier4.jp>
Signed-off-by: ktro2828 <kotaro.uetake@tier4.jp>
Signed-off-by: ktro2828 <kotaro.uetake@tier4.jp>
Signed-off-by: ktro2828 <kotaro.uetake@tier4.jp>
Signed-off-by: ktro2828 <kotaro.uetake@tier4.jp>
Signed-off-by: ktro2828 <kotaro.uetake@tier4.jp>
Signed-off-by: ktro2828 <kotaro.uetake@tier4.jp>
Signed-off-by: ktro2828 <kotaro.uetake@tier4.jp>
Signed-off-by: ktro2828 <kotaro.uetake@tier4.jp>
Signed-off-by: ktro2828 <kotaro.uetake@tier4.jp>
Signed-off-by: ktro2828 <kotaro.uetake@tier4.jp>
Signed-off-by: ktro2828 <kotaro.uetake@tier4.jp>
Signed-off-by: ktro2828 <kotaro.uetake@tier4.jp>
f6a71f5 to
ae62130
Compare
Signed-off-by: ktro2828 <kotaro.uetake@tier4.jp>
There was a problem hiding this comment.
LGTM 💯.
Please resolve this is subsequent PRs
#212 (comment)
What
This pull request introduces a new, extensible framework for dataset sanity checking, including a registry-based checker system, a context object for passing dataset metadata, and a set of modular, schema-driven field validation checkers. It also updates the CLI to use the new system and improves output formatting. The changes are organized into the following themes:
1. Sanity Checker Framework and Registry
Checkerclass int4_devkit/sanity/checker.pyfor implementing individual rule checkers, with support for skip logic and standardized result reporting.2. Context and Result Handling
SanityContextclass (t4_devkit/sanity/context.py) to encapsulate dataset metadata and provide convenient access to dataset paths and schema files.3. Modular Field Validation Checkers
t4_devkit/sanity/format/, and registered them in the new system. [1] [2] [3] [4] [5] [6] [7]4. CLI Refactor and Output Improvements
t4_devkit/cli/sanity.py) to use the new checker/result system, including improved summary and detailed reporting with tabular output, and support for serializing results. [1] [2] [3]returns, to support functional error handling and optional types.5. Documentation
docs/schema/requirement.md) listing all dataset structure, schema, reference, and format rules, serving as the basis for the implemented checkers.How to Use?
t4sanitycommand):For CLI, input datasets root must be the directory path of a dataset.
Sample of console output:
Sample output of JSON file:
result.json
sanity_check(...)function on your codebase: