@@ -64,6 +64,8 @@ at first use
6464- Data swapping applied to protect confidentiality
6565- Some geographic detail suppressed for small populations
6666
67+ ---
68+
6769### ` raw/tedsd_puf_2023.csv `
6870
6971** Source:** SAMHSA (Substance Abuse and Mental Health Services Administration)
7375
7476** Characteristics:**
7577
76- - ** Records:** ~ 1,400 ,000 discharge records
78+ - ** Records:** ~ 1,474 ,000 discharge records
7779- ** Variables:** ~ 50 variables
7880- ** Year:** 2023 discharges
7981- ** Coverage:** 50 U.S. states, DC, Puerto Rico
@@ -124,14 +126,28 @@ File
124126** Created by:** ` 2_data_preparation/cleaning_teds_d.ipynb `
125127** Purpose:** Fully cleaned TEDS-D dataset, human-readable
126128
127- ** Processing Steps:** Mirrors TEDS-A cleaning
129+ ** Processing Steps:**
130+
131+ 1 . Missing value codes (-9) converted to NaN
132+ 2 . Data types optimized (categorical, Int8, Int64)
133+ 3 . New features engineered for treatment optimization
134+ 4 . Categorical codes decoded to readable labels
135+ 5 . Relevant variables selected and renamed
128136
129137** Characteristics:**
130138
131- - ** Records:** ~ 1,400 ,000
132- - ** Variables:** ~ 70 variables
139+ - ** Records:** ~ 1,474 ,000 (all original records retained)
140+ - ** Variables:** 74 variables
133141- ** Format:** CSV with human-readable text labels
134- - ** Missing Data:** Preserved as NaN
142+ - ** Size:** ~ 500MB
143+ - ** Missing Data:** Preserved as NaN for proper handling
144+
145+ ** Use Cases:**
146+
147+ - Exploratory Data Analysis (EDA)
148+ - Initial statistical exploration
149+ - Visualization and reporting
150+ - General data inspection
135151
136152---
137153
@@ -171,17 +187,25 @@ File
171187- Minimizes selection bias
172188- Standard practice in epidemiological research
173189
190+ ---
191+
174192### ` processed/teds_d_analysis_ready.csv `
175193
176- ** Created by:** ` 2_data_preparation\ missing_value_handling_teds_d.ipynb `
194+ ** Created by:** ` 2_data_preparation/ missing_value_handling_teds_d.ipynb `
177195** Purpose:** Optimized for statistical analysis with minimal data loss
178196
179- ** Processing Steps:** Mirrors TEDS-A cleaning
197+ ** Processing Steps:**
198+
199+ - Removed rows missing critical variables only:
200+ - ` patient_id `
201+ - ` discharge_reason `
202+ - ` length_of_stay `
203+ - All other missing values preserved for pairwise deletion
180204
181205** Characteristics:**
182206
183- - ** Records:** ~ 1,40 ,000 (95% retention)
184- - ** Variables:** `70 variables
207+ - ** Records:** ~ 1,400 ,000 (95% retention)
208+ - ** Variables:** 74 variables
185209- ** Missing Data:** Present in non-critical variables
186210- ** Strategy:** Pairwise deletion (each test uses available data)
187211
@@ -242,20 +266,37 @@ File
242266
243267### ` processed/teds_d_ml_ready.csv `
244268
245- ** Created by:** ` 2_data_preparation\ missing_value_handling_teds_d.ipynb `
269+ ** Created by:** ` 2_data_preparation/ missing_value_handling_teds_d.ipynb `
246270** Purpose:** Imputed dataset ready for machine learning models
247271
248- ** Processing Steps:** Mirrors TEDS-A Cleaning
272+ ** Processing Steps:**
273+
274+ 1 . ** Numeric variables:** Imputed with median
275+ - ` years_using ` , ` number_of_substances_admit ` , ` number_of_substances_discharge `
276+
277+ 2 . ** Categorical variables:** Imputed with mode
278+ - ` wait_time_days ` , ` prior_treatments ` , ` employment_admit ` , ` employment_discharge `
279+ - ` education_level ` , ` living_arrangement_admit ` , ` living_arrangement_discharge `
280+ - ` income_source ` , ` length_of_stay ` , ` discharge_reason `
249281
250- 1 . ** Binary variables:** Imputed with 0 (negative/absent)
282+ 3 . ** Binary variables:** Imputed with 0 (negative/absent)
251283 - All ` is_* ` and ` has_* ` flags
284+ - Treatment outcomes: ` completed_treatment ` , ` dropped_out ` , ` terminated ` , ` transferred `
285+ - Stay indicators: ` short_stay ` , ` long_stay `
286+ - Improvement metrics: ` employment_improved ` , ` housing_improved ` , ` arrests_reduced `
287+
288+ 4 . ** Remaining variables:** Imputed based on data type
289+ - Categorical: mode or 'Unknown'
290+ - Numeric: median or 0
291+ - Includes: demographics (sex, race, ethnicity, marital_status), arrests
292+ data, substance details, clinical variables
252293
253294** Characteristics:**
254295
255- - ** Records:** ~ 1,0,000 (100% retention)
256- - ** Variables:** ~ 70 variables
257- - ** Missing Data:** Imputed (no NaN values )
258- - ** Strategy:** Statistical imputation
296+ - ** Records:** 1,474,025 (100% retention)
297+ - ** Variables:** 74 variables
298+ - ** Missing Data:** None (fully imputed )
299+ - ** Strategy:** Statistical imputation (median/mode/zero-filling)
259300
260301** Use Cases:**
261302
@@ -282,14 +323,14 @@ File
282323| ** Cleaned TEDS-A** | 1,625,833 | Yes (NaN) | 0% | EDA, visualization |
283324| ** Analysis Ready TEDS-A** | ~ 1,540,000 | Yes (non-critical) | ~ 5% | tests |
284325| ** ML Ready TEDS-A** | 1,625,833 | No (imputed) | 0% | Machine learning |
285- | ** Raw TEDS-D** | 1,400,000 | Yes (-9 codes) | 0% | Source data |
286- | ** Cleaned TEDS-D** | 1,400,000 | Yes (NaN) | 0% | EDA, visualization |
326+ | ** Raw TEDS-D** | 1,474,025 | Yes (-9 codes) | 0% | Source data |
327+ | ** Cleaned TEDS-D** | 1,474,025 | Yes (NaN) | 0% | EDA, visualization |
287328| ** Analysis Ready TEDS-D** | ~ 1,400,000 | Yes (non-critical) | ~ 5% | tests |
288- | ** ML Ready TEDS-D** | 1,400,000 | No (imputed) | 0% | Machine learning |
329+ | ** ML Ready TEDS-D** | 1,474,025 | No (imputed) | 0% | Machine learning |
289330
290331---
291332
292- ## Missing Data Patterns
333+ ## Missing Data Patterns - TEDS-A
293334
294335### High Missing Variables (>20%)
295336
@@ -309,6 +350,29 @@ File
309350
310351---
311352
353+ ## Missing Data Patterns - TEDS-D
354+
355+ ### High Missing Variables (>50%)
356+
357+ - ` arrests_discharge ` : 95.7%
358+ - ` arrests_admit ` : 94.2%
359+ - ` tertiary_substance_discharge ` : 84.7%
360+ - ` tertiary_substance_admit ` : 82.0%
361+ - ` pregnant ` : 67.4%
362+ - ` health_insurance ` : 58.1%
363+ - ` payment_source ` : 56.5%
364+
365+ ### Low Missing Variables (<2%)
366+
367+ - ` patient_id ` : 0%
368+ - ` age_group ` : 0%
369+ - ` sex ` : 0.06%
370+ - ` service_type_admit ` : 0%
371+ - ` discharge_reason ` : 0%
372+ - ` length_of_stay ` : 0%
373+
374+ ---
375+
312376## Data Processing Pipeline
313377
314378``` text
0 commit comments