This document explains what lives where and which files are source code, generated outputs, research reports, raw data, or exploratory material.
README.md: main project overview for scientific users.CODEBASE_AUDIT.md: health check, stability review, and recommended next steps.RUNBOOK.md: operational instructions for installing, running, and troubleshooting.REPOSITORY_MAP.md: this folder and major-file map.pyproject.toml: package metadata and runtime dependencies.requirements.txt: duplicate runtime dependency list for simple installs.requirements-verified.txt: exact tested package set for publication-oriented regeneration.activate-project.sh: local activation helper for.venv,PYTHONPATH, and Kaggle config.CITATION.cff: citation metadata for the repository.LICENSE: repository license..gitignore: local cache, environment, credential, and generated-example ignore rules.
Reusable source package for the scientific workflow.
data.py: dataset paths, descriptor-column naming, loading, curation, and train/test splitting.models.py: Logistic Regression, Random Forest, feedforward neural-network classifier, and reusable tabular regression baselines.cross_validation.py: stratified cross-validation, SMOTE sampling, and fold-level diagnostics.environment_validation.py: proxy cross-environment validation using descriptor-space clustering.descriptor_graph_model.py: exploratory descriptor-graph prototype adapted to descriptor-only data.feature_engineering.py: baseline descriptors, chemistry-informed proxy features, and feature-set comparison.feature_selection.py: Random Forest importance, permutation importance, mutual information, and reduced feature-set evaluation.uncertainty.py: calibration, Brier/log-loss metrics, expected calibration error, uncertainty, and selective prediction.reliability_scoreboard.py: final reliability-centered candidate ranking and model-selection figures.__init__.py: package marker.README.md: package-level navigation notes.
Classification: source code.
Executable workflow scripts. These are the main entry points for reproducing project artifacts.
01_curate_dataset.py: builds curated QSAR dataset and metadata report.02_run_baseline_models.py: runs Logistic Regression and Random Forest baselines.03_run_neural_network_baseline.py: runs feedforward neural-network baseline and comparison report.04_run_descriptor_graph_prototype.py: runs exploratory descriptor-graph prototype.05_run_cross_environment_validation.py: runs proxy environment validation and figures.06_run_stratified_cross_validation.py: runs stratified cross-validation and SMOTE comparison.07_run_feature_engineering_comparison.py: compares baseline descriptors with proxy chemistry features.08_run_feature_importance_selection.py: ranks features and evaluates reduced sets.09_run_uncertainty_reliability_analysis.py: generates calibration, uncertainty, selective prediction, and cross-environment reliability artifacts.10_run_model_reliability_scoreboard.py: generates final model-selection scoreboard, report, and figures.generate_all_results.py: orchestrates scripts01through10.smoke_test_pipeline.py: quick Random Forest pipeline sanity check.template_run_tabular_regression_baselines.py: reusable template for numeric QSAR/QSPR regression datasets.dataset_tools/build_homopolymer_bigsmiles_dataset.py: downloads/curates the homopolymer BigSMILES/Tg benchmark and provides optional SMILES-to-BigSMILES helper behavior.README.md: script navigation notes.
Classification: executable source code and workflow entry points.
Dataset-specific storage. Each dataset folder uses raw/, processed/, and metadata/ where applicable.
Primary biodegradation classification dataset.
raw/qsar_biodegradation.csv: raw descriptor-only QSAR biodegradation data.processed/qsar_biodegradation_curated.csv: generated curated dataset with stable column names and labels.metadata/qsar_biodegradation_metadata.json: generated curation metadata.- README files: dataset explanation and navigation.
Classification: raw data, generated processed data, generated metadata.
Structure-aware polymer property dataset for BigSMILES and glass-transition-temperature representation work.
raw/with_Tg.zip: downloaded source archive.raw/with_Tg/Bicerano_bigsmiles.csv: raw extracted experimental subset.raw/with_Tg/JCIM_sup_bigsmiles.csv: raw extracted supplementary subset.processed/homopolymer_bigsmiles_tg_curated.csv: generated curated BigSMILES/Tg table.metadata/source_manifest.json: generated provenance and curation summary.
Classification: raw data, generated processed data, generated metadata. This is not a biodegradation-label dataset.
Large SMILES/activity benchmark downloaded from Kaggle for structure-representation exploration.
raw/SMILES_Big_Data_Set.csv: raw source file.processed/big_molecules_smiles_activity_curated.csv: curated copy.metadata/source_manifest.json: source metadata.
Classification: raw data, processed data, metadata. Publication/license status should be verified before formal use.
Reference/schema area for future curated polymer repeat-unit examples.
processed/bigsmiles_common_repeat_units_schema.csv: schema/reference table.metadata/source_manifest.json: metadata.
Classification: reference data and metadata.
Notes on possible structure-aware polymer dataset sources.
Classification: research notes.
Machine-readable generated outputs.
results/tables/: generated CSV result tables.results/metadata/: generated JSON metrics and run metadata.results/predictions/: generated prediction-level outputs.- README files: folder-level navigation.
Important canonical files:
results/tables/model_reliability_scoreboard.csvresults/tables/final_uncertainty_metrics.csvresults/predictions/final_model_predictions.csvresults/metadata/feature_selection_sets.json
Classification: generated outputs. Preserve for reproducible reporting, but regenerate when finalizing publication results.
Generated and curated visual outputs.
figures/branding/: banner and identity assets.figures/cross_environment/: proxy environment validation plots.figures/feature_engineering/: feature-set comparison plots.figures/feature_importance/: feature ranking and reduced-set plots.figures/uncertainty_calibration/: calibration, uncertainty, and selective prediction plots.figures/model_selection/: final reliability scoreboard plots.figures/paper/: curated paper-facing copies of key figures.
Classification: generated figures and curated presentation/publication assets.
Human-readable generated summaries.
dataset_curation.md: dataset curation and representation notes.baseline_modeling.md: train/test baseline report.neural_network_baseline_summary.txt: FNN baseline interpretation.descriptor_graph_prototype_summary.txt: descriptor-graph exploratory result summary.cross_environment_validation.md: proxy environment validation report.stratified_cross_validation_summary.txt: cross-validation and SMOTE summary.feature_engineering_summary.txt: feature-engineering comparison summary.feature_importance_selection_summary.txt: feature ranking and reduced-set summary.uncertainty_reliability_summary.txt: uncertainty and selective prediction summary.model_reliability_report.md: final reliability-centered model-selection report.model_reliability_summary.txt: compact final model-selection summary.main_findings.md: high-level research findings.archive/: preserved earlier weekly/generated report snapshots.
Classification: generated reports and archived research provenance.
Potentially stale/experimental:
- Files in
reports/archive/reflect earlier project stages. They are useful provenance but should not be treated as final paper outputs without verification.
Manuscript workspace.
manuscript_draft.md: scientific manuscript draft.PolyDegradeML_consolidated_manuscript.docx: Word export for review/submission workflows.evidence_mapped_outline.md: outline mapped to repository evidence.tables_and_figures_plan.md: proposed paper table/figure plan.publication_readiness_checklist.md: checklist for publication review.missing_information_to_ask_author.md: unresolved information requests.build_consolidated_docx.py: document generation utility.README.md: paper folder navigation.
Classification: manuscript source, generated document artifact, publication planning.
Slide and communication workspace.
PolyDegradeML_comprehensive_defense.pptx: comprehensive conference/thesis-style deck.PolyDegradeML_comprehensive_defense_speaker_notes.md: speaker notes.PolyDegradeML_academic_ted_talk.pptx: academic/TED-style deck.PolyDegradeML_academic_ted_talk_speaker_notes.md: speaker notes.Plastic_Degradation_Project_Presentation.pptx: earlier/project deck.PROJECT_SLIDES.md: markdown slide source/notes.SLIDE_ASSET_GUIDE.md: asset usage guide.build_*.mjsandbuild_project_slides.py: slide-building utilities.assets/: presentation-ready figure copies.
Classification: presentation source, generated decks, and visual assets.
Preserved original Applied Work and Outside Reading reports.
source_materials/reports/Week_01_Applied_Work.docxthrough Week 13 reports.- Outside Reading report files for literature-review and interpretation provenance.
README.md: source-material notes.
Classification: research provenance and literature/source material. These are not generated by the workflow scripts.
Developer and architecture documentation.
docs/uml/project_architecture.mmd: architecture diagram.docs/uml/class_diagram.mmd: major classes/dataclasses.docs/uml/workflow_sequence.mmd: workflow sequence.docs/publication_artifact_provenance.md: maps manuscript figures/tables to scripts and source outputs.docs/publication_run_2026-05-28.md: records the stabilized publication-run command, environment, and observed output changes.- README files: navigation.
Classification: documentation.
Exploratory notebook area.
notebooks/exploratory/README.md: explains that exploratory notebooks are non-canonical.
Classification: exploratory workspace. Canonical reproducible work should remain in scripts/ and src/.
Test suite.
tests/test_data.py: unit/regression tests for data loading, curation, model workflows, feature engineering, uncertainty, and reliability scoreboard.tests/README.md: test navigation.
Classification: test source code.
synthetic_regression_example.csv: local generated example output, ignored by Git.regression_scores_example.json: local generated example output, ignored by Git..kaggle/access_token: local credential/config file, ignored by Git. Do not commit.__pycache__/: Python bytecode cache directories. Removed during the health pass and ignored by Git.
Do not delete reports, paper files, presentations, datasets, or figures without explicit approval.