refactor: standardize CSV loading from ./datasets and improve robustness#224
Merged
msoedov merged 1 commit intomsoedov:mainfrom May 19, 2025
Merged
refactor: standardize CSV loading from ./datasets and improve robustness#224msoedov merged 1 commit intomsoedov:mainfrom
msoedov merged 1 commit intomsoedov:mainfrom
Conversation
- Load all CSVs from ./datasets directory - Add encoding_errors='ignore' for resilient CSV parsing - Ensure prompt generators are converted to lists before sampling
msoedov
approved these changes
May 19, 2025
Owner
|
Hi @Mundi-Xu, thank you for the PR. You are a legend! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR standardizes and improves the robustness of loading local CSV files used for prompt datasets. The main changes include:
Unified CSV Loading Path
All CSV files are now consistently loaded from the
./datasetsdirectory instead of the current working directory. This improves project structure and avoids mixing data with code.Improved Fault Tolerance
Added
encoding_errors="ignore"to allpd.read_csvcalls to gracefully handle files with encoding issues or unexpected characters.Prompt List Handling Fix
In
StenographyTransformer, added a check to ensuredataset.promptsis iterable and converted it to a list before sampling, preventing runtime errors when handling generators or non-list types.