@@ -12,14 +12,19 @@ processed subfolders following best practices for reproducible research.
1212``` text
13131_datasets/
1414├── raw/
15- │ └── tedsa_puf_2023.csv
15+ │ ├── tedsa_puf_2023.csv
16+ │ └── tedsd_puf_2023.csv
1617├── processed/
1718│ ├── teds_a_2023_cleaned.csv
1819│ ├── teds_analysis_ready.csv
19- │ └── teds_ml_ready.csv
20- ├──sample/
20+ │ ├── teds_ml_ready.csv
21+ │ ├── teds_d_2023_cleaned.csv
22+ │ ├── teds_d_analysis_ready.csv
23+ │ └── teds_d_ml_ready.csv
24+ ├── sample/
2125│ └── tedsa_sample.csv
2226└── README.md
27+
2328```
2429
2530---
4146- ** Coverage:** 50 U.S. states, DC, Puerto Rico (excluding Delaware, South
4247Carolina, West Virginia)
4348- ** Format:** CSV with numeric codes for categorical variables
44- - ** Size:** ~ 800 MB
49+ - ** Size:** ~ 300 MB
4550
4651** Key Variables:**
4752
@@ -59,13 +64,34 @@ at first use
5964- Data swapping applied to protect confidentiality
6065- Some geographic detail suppressed for small populations
6166
67+ ### ` raw/tedsd_puf_2023.csv `
68+
69+ ** Source:** SAMHSA (Substance Abuse and Mental Health Services Administration)
70+ ** Dataset:** Treatment Episode Data Set - Discharges (TEDS-D), 2023 Public Use
71+ File
72+ ** Download:** [ SAMHSA TEDS Data] ( https://www.samhsa.gov/data/data-we-collect/teds-treatment-episode-data-set/datafiles )
73+
74+ ** Characteristics:**
75+
76+ - ** Records:** ~ 1,400,000 discharge records
77+ - ** Variables:** ~ 50 variables
78+ - ** Year:** 2023 discharges
79+ - ** Coverage:** 50 U.S. states, DC, Puerto Rico
80+ - ** Format:** CSV with numeric codes for categorical variables
81+ - ** Size:** ~ 300 MB
82+
83+ ** Key Variables:**
84+
85+ - Demographics, substance use, treatment, clinical, socioeconomic, geographic
86+ - Discharge-specific: _ D suffix variables track patient status at discharge
87+
6288---
6389
6490## Processed Data
6591
6692### ` processed/teds_a_2023_cleaned.csv `
6793
68- ** Created by:** ` 2_data_preparation\cleaning_preprocessing .ipynb `
94+ ** Created by:** ` 2_data_preparation\cleaning_teds_a .ipynb `
6995** Purpose:** Fully cleaned and human-readable dataset for all analyses
7096
7197** Processing Steps:**
@@ -81,7 +107,7 @@ at first use
81107- ** Records:** ~ 1,625,833 (all original records retained)
82108- ** Variables:** 50 variables
83109- ** Format:** CSV with human-readable text labels
84- - ** Size:** ~ 240 MB (70% reduction from raw)
110+ - ** Size:** ~ 500MB
85111- ** Missing Data:** Preserved as NaN for proper handling
86112
87113** Variables:** See main project README or data dictionary for complete list
@@ -93,11 +119,25 @@ at first use
93119- Visualization and reporting
94120- General data inspection
95121
122+ ### ` processed/teds_d_2023_cleaned.csv `
123+
124+ ** Created by:** ` 2_data_preparation/cleaning_teds_d.ipynb `
125+ ** Purpose:** Fully cleaned TEDS-D dataset, human-readable
126+
127+ ** Processing Steps:** Mirrors TEDS-A cleaning
128+
129+ ** Characteristics:**
130+
131+ - ** Records:** ~ 1,400,000
132+ - ** Variables:** ~ 70 variables
133+ - ** Format:** CSV with human-readable text labels
134+ - ** Missing Data:** Preserved as NaN
135+
96136---
97137
98138### ` processed/teds_analysis_ready.csv `
99139
100- ** Created by:** ` 2_data_preparation\missing_value_handling .ipynb `
140+ ** Created by:** ` 2_data_preparation\missing_value_handling_teds_a .ipynb `
101141** Purpose:** Optimized for statistical analysis with minimal data loss
102142
103143** Processing Steps:**
@@ -131,11 +171,39 @@ at first use
131171- Minimizes selection bias
132172- Standard practice in epidemiological research
133173
174+ ### ` processed/teds_d_analysis_ready.csv `
175+
176+ ** Created by:** ` 2_data_preparation\missing_value_handling_teds_d.ipynb `
177+ ** Purpose:** Optimized for statistical analysis with minimal data loss
178+
179+ ** Processing Steps:** Mirrors TEDS-A cleaning
180+
181+ ** Characteristics:**
182+
183+ - ** Records:** ~ 1,40,000 (95% retention)
184+ - ** Variables:** `70 variables
185+ - ** Missing Data:** Present in non-critical variables
186+ - ** Strategy:** Pairwise deletion (each test uses available data)
187+
188+ ** Use Cases:**
189+
190+ - Statistical hypothesis testing
191+ - Correlation analysis
192+ - Chi-square tests
193+ - Group comparisons (t-tests, ANOVA, Mann-Whitney)
194+ - Regression analysis
195+
196+ ** Advantages:**
197+
198+ - Maximizes statistical power
199+ - Minimizes selection bias
200+ - Standard practice in epidemiological research
201+
134202---
135203
136204### ` processed/teds_ml_ready.csv `
137205
138- ** Created by:** ` 2_data_preparation\missing_value_handling .ipynb `
206+ ** Created by:** ` 2_data_preparation\missing_value_handling_teds_a .ipynb `
139207** Purpose:** Imputed dataset ready for machine learning models
140208
141209** Processing Steps:**
@@ -172,16 +240,52 @@ at first use
172240- Document imputation methods in model cards
173241- Consider multiple imputation for sensitivity analysis
174242
243+ ### ` processed/teds_d_ml_ready.csv `
244+
245+ ** Created by:** ` 2_data_preparation\missing_value_handling_teds_d.ipynb `
246+ ** Purpose:** Imputed dataset ready for machine learning models
247+
248+ ** Processing Steps:** Mirrors TEDS-A Cleaning
249+
250+ 1 . ** Binary variables:** Imputed with 0 (negative/absent)
251+ - All ` is_* ` and ` has_* ` flags
252+
253+ ** Characteristics:**
254+
255+ - ** Records:** ~ 1,0,000 (100% retention)
256+ - ** Variables:** ~ 70 variables
257+ - ** Missing Data:** Imputed (no NaN values)
258+ - ** Strategy:** Statistical imputation
259+
260+ ** Use Cases:**
261+
262+ - Machine learning model training
263+ - Predictive modeling
264+ - Classification and regression algorithms
265+ - Neural networks
266+ - Ensemble methods
267+
268+ ** Important Notes:**
269+
270+ - Imputation may introduce bias
271+ - Use with caution for inferential statistics
272+ - Document imputation methods in model cards
273+ - Consider multiple imputation for sensitivity analysis
274+
175275---
176276
177277## Data Quality Summary
178278
179279| Dataset | Records | Missing Data | Data Loss | Primary Use |
180280| ---------| ---------| --------------| -----------| -------------|
181- | ** Raw** | 1,625,833 | Yes (-9 codes) | 0% | Source data |
182- | ** Cleaned** | 1,625,833 | Yes (NaN) | 0% | EDA, visualization |
183- | ** Analysis Ready** | ~ 1,540,000 | Yes (non-critical) | ~ 5% | Statistical test|
184- | ** ML Ready** | 1,625,833 | No (imputed) | 0% | Machine learning |
281+ | ** Raw TEDS-A** | 1,625,833 | Yes (-9 codes) | 0% | Source data |
282+ | ** Cleaned TEDS-A** | 1,625,833 | Yes (NaN) | 0% | EDA, visualization |
283+ | ** Analysis Ready TEDS-A** | ~ 1,540,000 | Yes (non-critical) | ~ 5% | tests |
284+ | ** ML Ready TEDS-A** | 1,625,833 | No (imputed) | 0% | Machine learning |
285+ | ** Raw TEDS-D** | 1,400,000 | Yes (-9 codes) | 0% | Source data |
286+ | ** Cleaned TEDS-D** | 1,400,000 | Yes (NaN) | 0% | EDA, visualization |
287+ | ** Analysis Ready TEDS-D** | ~ 1,400,000 | Yes (non-critical) | ~ 5% | tests |
288+ | ** ML Ready TEDS-D** | 1,400,000 | No (imputed) | 0% | Machine learning |
185289
186290---
187291
@@ -208,29 +312,32 @@ at first use
208312## Data Processing Pipeline
209313
210314``` text
211- Raw Data (tedsa_puf_2023 .csv)
315+ Raw Data (tedsx_puf_2023 .csv)
212316 ↓
213317[Data Cleaning Pipeline]
214318 ↓
215- Cleaned Data (teds_a_2023_cleaned .csv)
319+ Cleaned Data (teds_x_2023_cleaned .csv)
216320 ↓
217321[Missing Value Strategy]
218322 ↓
219323 ├── Analysis Ready (95% data, pairwise deletion)
220324 └── ML Ready (100% data, imputation)
221325```
222326
223- > ** Note:** The TEDS-A dataset is not included in this repository due to size
327+ > ** Note:** The TEDS-x dataset is not included in this repository due to size
224328 and data governance considerations.
225329> To reproduce analyses, download the dataset directly from SAMHSA TEDS Data.
226330> and place it in the ` 1_datasets/raw/ ` folder.
331+ > x refers to A and D both.
227332
228333### Reproducing Results
229334
230335To reproduce cleaning and preprocessing:
231336
232- 1 . Download ` tedsa_puf_2023.csv `
337+ 1 . Download ` tedsa_puf_2023.csv ` and ` tedsd_puf_2023.csv `
2333382 . Place it in ` 1_datasets/raw/ `
234- 3 . Run ` 2_data_preparation/cleaning_preprocessing.ipynb ` and
235- ` 2_data_preparation\missing_value_handling.ipynb `
339+ 3 . Run ` 2_data_preparation/cleaning_teds_a.ipynb ` ,
340+ ` 2_data_preparation/cleaning_teds_d.ipynb `
341+ and ` 2_data_preparation/missing_value_handling_teds_a.ipynb ` ,
342+ ` 2_data_preparation/missing_value_handling_teds_d.ipynb ` .
2363434 . Output files will appear in ` 1_datasets/processed/ `
0 commit comments