Skip to content

Commit 959c5ce

Browse files
committed
adding the missing value handling for teds-d
1 parent b74f8db commit 959c5ce

2 files changed

Lines changed: 558 additions & 20 deletions

File tree

1_datasets/README.md

Lines changed: 84 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,8 @@ at first use
6464
- Data swapping applied to protect confidentiality
6565
- Some geographic detail suppressed for small populations
6666

67+
---
68+
6769
### `raw/tedsd_puf_2023.csv`
6870

6971
**Source:** SAMHSA (Substance Abuse and Mental Health Services Administration)
@@ -73,7 +75,7 @@ File
7375

7476
**Characteristics:**
7577

76-
- **Records:** ~1,400,000 discharge records
78+
- **Records:** ~1,474,000 discharge records
7779
- **Variables:** ~50 variables
7880
- **Year:** 2023 discharges
7981
- **Coverage:** 50 U.S. states, DC, Puerto Rico
@@ -124,14 +126,28 @@ File
124126
**Created by:** `2_data_preparation/cleaning_teds_d.ipynb`
125127
**Purpose:** Fully cleaned TEDS-D dataset, human-readable
126128

127-
**Processing Steps:** Mirrors TEDS-A cleaning
129+
**Processing Steps:**
130+
131+
1. Missing value codes (-9) converted to NaN
132+
2. Data types optimized (categorical, Int8, Int64)
133+
3. New features engineered for treatment optimization
134+
4. Categorical codes decoded to readable labels
135+
5. Relevant variables selected and renamed
128136

129137
**Characteristics:**
130138

131-
- **Records:** ~1,400,000
132-
- **Variables:** ~70 variables
139+
- **Records:** ~1,474,000 (all original records retained)
140+
- **Variables:** 74 variables
133141
- **Format:** CSV with human-readable text labels
134-
- **Missing Data:** Preserved as NaN
142+
- **Size:** ~500MB
143+
- **Missing Data:** Preserved as NaN for proper handling
144+
145+
**Use Cases:**
146+
147+
- Exploratory Data Analysis (EDA)
148+
- Initial statistical exploration
149+
- Visualization and reporting
150+
- General data inspection
135151

136152
---
137153

@@ -171,17 +187,25 @@ File
171187
- Minimizes selection bias
172188
- Standard practice in epidemiological research
173189

190+
---
191+
174192
### `processed/teds_d_analysis_ready.csv`
175193

176-
**Created by:** `2_data_preparation\missing_value_handling_teds_d.ipynb`
194+
**Created by:** `2_data_preparation/missing_value_handling_teds_d.ipynb`
177195
**Purpose:** Optimized for statistical analysis with minimal data loss
178196

179-
**Processing Steps:** Mirrors TEDS-A cleaning
197+
**Processing Steps:**
198+
199+
- Removed rows missing critical variables only:
200+
- `patient_id`
201+
- `discharge_reason`
202+
- `length_of_stay`
203+
- All other missing values preserved for pairwise deletion
180204

181205
**Characteristics:**
182206

183-
- **Records:** ~1,40,000 (95% retention)
184-
- **Variables:** `70 variables
207+
- **Records:** ~1,400,000 (95% retention)
208+
- **Variables:** 74 variables
185209
- **Missing Data:** Present in non-critical variables
186210
- **Strategy:** Pairwise deletion (each test uses available data)
187211

@@ -242,20 +266,37 @@ File
242266

243267
### `processed/teds_d_ml_ready.csv`
244268

245-
**Created by:** `2_data_preparation\missing_value_handling_teds_d.ipynb`
269+
**Created by:** `2_data_preparation/missing_value_handling_teds_d.ipynb`
246270
**Purpose:** Imputed dataset ready for machine learning models
247271

248-
**Processing Steps:** Mirrors TEDS-A Cleaning
272+
**Processing Steps:**
273+
274+
1. **Numeric variables:** Imputed with median
275+
- `years_using`, `number_of_substances_admit`, `number_of_substances_discharge`
276+
277+
2. **Categorical variables:** Imputed with mode
278+
- `wait_time_days`, `prior_treatments`, `employment_admit`, `employment_discharge`
279+
- `education_level`, `living_arrangement_admit`, `living_arrangement_discharge`
280+
- `income_source`, `length_of_stay`, `discharge_reason`
249281

250-
1. **Binary variables:** Imputed with 0 (negative/absent)
282+
3. **Binary variables:** Imputed with 0 (negative/absent)
251283
- All `is_*` and `has_*` flags
284+
- Treatment outcomes: `completed_treatment`, `dropped_out`, `terminated`, `transferred`
285+
- Stay indicators: `short_stay`, `long_stay`
286+
- Improvement metrics: `employment_improved`, `housing_improved`, `arrests_reduced`
287+
288+
4. **Remaining variables:** Imputed based on data type
289+
- Categorical: mode or 'Unknown'
290+
- Numeric: median or 0
291+
- Includes: demographics (sex, race, ethnicity, marital_status), arrests
292+
data, substance details, clinical variables
252293

253294
**Characteristics:**
254295

255-
- **Records:** ~1,0,000 (100% retention)
256-
- **Variables:** ~70 variables
257-
- **Missing Data:** Imputed (no NaN values)
258-
- **Strategy:** Statistical imputation
296+
- **Records:** 1,474,025 (100% retention)
297+
- **Variables:** 74 variables
298+
- **Missing Data:** None (fully imputed)
299+
- **Strategy:** Statistical imputation (median/mode/zero-filling)
259300

260301
**Use Cases:**
261302

@@ -282,14 +323,14 @@ File
282323
| **Cleaned TEDS-A** | 1,625,833 | Yes (NaN) | 0% | EDA, visualization |
283324
| **Analysis Ready TEDS-A** | ~1,540,000 | Yes (non-critical) | ~5% | tests |
284325
| **ML Ready TEDS-A** | 1,625,833 | No (imputed) | 0% | Machine learning |
285-
| **Raw TEDS-D** | 1,400,000 | Yes (-9 codes) | 0% | Source data |
286-
| **Cleaned TEDS-D** | 1,400,000 | Yes (NaN) | 0% | EDA, visualization |
326+
| **Raw TEDS-D** | 1,474,025 | Yes (-9 codes) | 0% | Source data |
327+
| **Cleaned TEDS-D** | 1,474,025 | Yes (NaN) | 0% | EDA, visualization |
287328
| **Analysis Ready TEDS-D** | ~1,400,000 | Yes (non-critical) | ~5% | tests |
288-
| **ML Ready TEDS-D** | 1,400,000 | No (imputed) | 0% | Machine learning |
329+
| **ML Ready TEDS-D** | 1,474,025 | No (imputed) | 0% | Machine learning |
289330

290331
---
291332

292-
## Missing Data Patterns
333+
## Missing Data Patterns - TEDS-A
293334

294335
### High Missing Variables (>20%)
295336

@@ -309,6 +350,29 @@ File
309350

310351
---
311352

353+
## Missing Data Patterns - TEDS-D
354+
355+
### High Missing Variables (>50%)
356+
357+
- `arrests_discharge`: 95.7%
358+
- `arrests_admit`: 94.2%
359+
- `tertiary_substance_discharge`: 84.7%
360+
- `tertiary_substance_admit`: 82.0%
361+
- `pregnant`: 67.4%
362+
- `health_insurance`: 58.1%
363+
- `payment_source`: 56.5%
364+
365+
### Low Missing Variables (<2%)
366+
367+
- `patient_id`: 0%
368+
- `age_group`: 0%
369+
- `sex`: 0.06%
370+
- `service_type_admit`: 0%
371+
- `discharge_reason`: 0%
372+
- `length_of_stay`: 0%
373+
374+
---
375+
312376
## Data Processing Pipeline
313377

314378
```text

0 commit comments

Comments
 (0)