Skip to content

refactor: standardize CSV loading from ./datasets and improve robustness#224

Merged
msoedov merged 1 commit intomsoedov:mainfrom
Mundi-Xu:datasets-optimize
May 19, 2025
Merged

refactor: standardize CSV loading from ./datasets and improve robustness#224
msoedov merged 1 commit intomsoedov:mainfrom
Mundi-Xu:datasets-optimize

Conversation

@Mundi-Xu
Copy link
Copy Markdown
Contributor

This PR standardizes and improves the robustness of loading local CSV files used for prompt datasets. The main changes include:

  1. Unified CSV Loading Path
    All CSV files are now consistently loaded from the ./datasets directory instead of the current working directory. This improves project structure and avoids mixing data with code.

  2. Improved Fault Tolerance
    Added encoding_errors="ignore" to all pd.read_csv calls to gracefully handle files with encoding issues or unexpected characters.

  3. Prompt List Handling Fix
    In StenographyTransformer, added a check to ensure dataset.prompts is iterable and converted it to a list before sampling, preventing runtime errors when handling generators or non-list types.

- Load all CSVs from ./datasets directory
- Add encoding_errors='ignore' for resilient CSV parsing
- Ensure prompt generators are converted to lists before sampling
@msoedov
Copy link
Copy Markdown
Owner

msoedov commented May 19, 2025

Hi @Mundi-Xu, thank you for the PR. You are a legend!

@msoedov msoedov merged commit 2bc0605 into msoedov:main May 19, 2025
2 of 5 checks passed
@Mundi-Xu Mundi-Xu deleted the datasets-optimize branch May 19, 2025 10:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants