Skip to content

Commit 3ebec7e

Browse files
committed
adding teds_d to the datasets README
1 parent 05f3c78 commit 3ebec7e

1 file changed

Lines changed: 125 additions & 18 deletions

File tree

1_datasets/README.md

Lines changed: 125 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -12,14 +12,19 @@ processed subfolders following best practices for reproducible research.
1212
```text
1313
1_datasets/
1414
├── raw/
15-
│ └── tedsa_puf_2023.csv
15+
│ ├── tedsa_puf_2023.csv
16+
│ └── tedsd_puf_2023.csv
1617
├── processed/
1718
│ ├── teds_a_2023_cleaned.csv
1819
│ ├── teds_analysis_ready.csv
19-
│ └── teds_ml_ready.csv
20-
├──sample/
20+
│ ├── teds_ml_ready.csv
21+
│ ├── teds_d_2023_cleaned.csv
22+
│ ├── teds_d_analysis_ready.csv
23+
│ └── teds_d_ml_ready.csv
24+
├── sample/
2125
│ └── tedsa_sample.csv
2226
└── README.md
27+
2328
```
2429

2530
---
@@ -41,7 +46,7 @@ File
4146
- **Coverage:** 50 U.S. states, DC, Puerto Rico (excluding Delaware, South
4247
Carolina, West Virginia)
4348
- **Format:** CSV with numeric codes for categorical variables
44-
- **Size:** ~800 MB
49+
- **Size:** ~300 MB
4550

4651
**Key Variables:**
4752

@@ -59,13 +64,34 @@ at first use
5964
- Data swapping applied to protect confidentiality
6065
- Some geographic detail suppressed for small populations
6166

67+
### `raw/tedsd_puf_2023.csv`
68+
69+
**Source:** SAMHSA (Substance Abuse and Mental Health Services Administration)
70+
**Dataset:** Treatment Episode Data Set - Discharges (TEDS-D), 2023 Public Use
71+
File
72+
**Download:** [SAMHSA TEDS Data](https://www.samhsa.gov/data/data-we-collect/teds-treatment-episode-data-set/datafiles)
73+
74+
**Characteristics:**
75+
76+
- **Records:** ~1,400,000 discharge records
77+
- **Variables:** ~50 variables
78+
- **Year:** 2023 discharges
79+
- **Coverage:** 50 U.S. states, DC, Puerto Rico
80+
- **Format:** CSV with numeric codes for categorical variables
81+
- **Size:** ~300 MB
82+
83+
**Key Variables:**
84+
85+
- Demographics, substance use, treatment, clinical, socioeconomic, geographic
86+
- Discharge-specific: _D suffix variables track patient status at discharge
87+
6288
---
6389

6490
## Processed Data
6591

6692
### `processed/teds_a_2023_cleaned.csv`
6793

68-
**Created by:** `2_data_preparation\cleaning_preprocessing.ipynb`
94+
**Created by:** `2_data_preparation\cleaning_teds_a.ipynb`
6995
**Purpose:** Fully cleaned and human-readable dataset for all analyses
7096

7197
**Processing Steps:**
@@ -81,7 +107,7 @@ at first use
81107
- **Records:** ~1,625,833 (all original records retained)
82108
- **Variables:** 50 variables
83109
- **Format:** CSV with human-readable text labels
84-
- **Size:** ~240 MB (70% reduction from raw)
110+
- **Size:** ~500MB
85111
- **Missing Data:** Preserved as NaN for proper handling
86112

87113
**Variables:** See main project README or data dictionary for complete list
@@ -93,11 +119,25 @@ at first use
93119
- Visualization and reporting
94120
- General data inspection
95121

122+
### `processed/teds_d_2023_cleaned.csv`
123+
124+
**Created by:** `2_data_preparation/cleaning_teds_d.ipynb`
125+
**Purpose:** Fully cleaned TEDS-D dataset, human-readable
126+
127+
**Processing Steps:** Mirrors TEDS-A cleaning
128+
129+
**Characteristics:**
130+
131+
- **Records:** ~1,400,000
132+
- **Variables:** ~70 variables
133+
- **Format:** CSV with human-readable text labels
134+
- **Missing Data:** Preserved as NaN
135+
96136
---
97137

98138
### `processed/teds_analysis_ready.csv`
99139

100-
**Created by:** `2_data_preparation\missing_value_handling.ipynb`
140+
**Created by:** `2_data_preparation\missing_value_handling_teds_a.ipynb`
101141
**Purpose:** Optimized for statistical analysis with minimal data loss
102142

103143
**Processing Steps:**
@@ -131,11 +171,39 @@ at first use
131171
- Minimizes selection bias
132172
- Standard practice in epidemiological research
133173

174+
### `processed/teds_d_analysis_ready.csv`
175+
176+
**Created by:** `2_data_preparation\missing_value_handling_teds_d.ipynb`
177+
**Purpose:** Optimized for statistical analysis with minimal data loss
178+
179+
**Processing Steps:** Mirrors TEDS-A cleaning
180+
181+
**Characteristics:**
182+
183+
- **Records:** ~1,40,000 (95% retention)
184+
- **Variables:** `70 variables
185+
- **Missing Data:** Present in non-critical variables
186+
- **Strategy:** Pairwise deletion (each test uses available data)
187+
188+
**Use Cases:**
189+
190+
- Statistical hypothesis testing
191+
- Correlation analysis
192+
- Chi-square tests
193+
- Group comparisons (t-tests, ANOVA, Mann-Whitney)
194+
- Regression analysis
195+
196+
**Advantages:**
197+
198+
- Maximizes statistical power
199+
- Minimizes selection bias
200+
- Standard practice in epidemiological research
201+
134202
---
135203

136204
### `processed/teds_ml_ready.csv`
137205

138-
**Created by:** `2_data_preparation\missing_value_handling.ipynb`
206+
**Created by:** `2_data_preparation\missing_value_handling_teds_a.ipynb`
139207
**Purpose:** Imputed dataset ready for machine learning models
140208

141209
**Processing Steps:**
@@ -172,16 +240,52 @@ at first use
172240
- Document imputation methods in model cards
173241
- Consider multiple imputation for sensitivity analysis
174242

243+
### `processed/teds_d_ml_ready.csv`
244+
245+
**Created by:** `2_data_preparation\missing_value_handling_teds_d.ipynb`
246+
**Purpose:** Imputed dataset ready for machine learning models
247+
248+
**Processing Steps:** Mirrors TEDS-A Cleaning
249+
250+
1. **Binary variables:** Imputed with 0 (negative/absent)
251+
- All `is_*` and `has_*` flags
252+
253+
**Characteristics:**
254+
255+
- **Records:** ~1,0,000 (100% retention)
256+
- **Variables:** ~70 variables
257+
- **Missing Data:** Imputed (no NaN values)
258+
- **Strategy:** Statistical imputation
259+
260+
**Use Cases:**
261+
262+
- Machine learning model training
263+
- Predictive modeling
264+
- Classification and regression algorithms
265+
- Neural networks
266+
- Ensemble methods
267+
268+
**Important Notes:**
269+
270+
- Imputation may introduce bias
271+
- Use with caution for inferential statistics
272+
- Document imputation methods in model cards
273+
- Consider multiple imputation for sensitivity analysis
274+
175275
---
176276

177277
## Data Quality Summary
178278

179279
| Dataset | Records | Missing Data | Data Loss | Primary Use |
180280
|---------|---------|--------------|-----------|-------------|
181-
| **Raw** | 1,625,833 | Yes (-9 codes) | 0% | Source data |
182-
| **Cleaned** | 1,625,833 | Yes (NaN) | 0% | EDA, visualization |
183-
| **Analysis Ready** | ~1,540,000 | Yes (non-critical) | ~5% | Statistical test|
184-
| **ML Ready** | 1,625,833 | No (imputed) | 0% | Machine learning |
281+
| **Raw TEDS-A** | 1,625,833 | Yes (-9 codes) | 0% | Source data |
282+
| **Cleaned TEDS-A** | 1,625,833 | Yes (NaN) | 0% | EDA, visualization |
283+
| **Analysis Ready TEDS-A** | ~1,540,000 | Yes (non-critical) | ~5% | tests |
284+
| **ML Ready TEDS-A** | 1,625,833 | No (imputed) | 0% | Machine learning |
285+
| **Raw TEDS-D** | 1,400,000 | Yes (-9 codes) | 0% | Source data |
286+
| **Cleaned TEDS-D** | 1,400,000 | Yes (NaN) | 0% | EDA, visualization |
287+
| **Analysis Ready TEDS-D** | ~1,400,000 | Yes (non-critical) | ~5% | tests |
288+
| **ML Ready TEDS-D** | 1,400,000 | No (imputed) | 0% | Machine learning |
185289

186290
---
187291

@@ -208,29 +312,32 @@ at first use
208312
## Data Processing Pipeline
209313

210314
```text
211-
Raw Data (tedsa_puf_2023.csv)
315+
Raw Data (tedsx_puf_2023.csv)
212316
213317
[Data Cleaning Pipeline]
214318
215-
Cleaned Data (teds_a_2023_cleaned.csv)
319+
Cleaned Data (teds_x_2023_cleaned.csv)
216320
217321
[Missing Value Strategy]
218322
219323
├── Analysis Ready (95% data, pairwise deletion)
220324
└── ML Ready (100% data, imputation)
221325
```
222326

223-
> **Note:** The TEDS-A dataset is not included in this repository due to size
327+
> **Note:** The TEDS-x dataset is not included in this repository due to size
224328
and data governance considerations.
225329
> To reproduce analyses, download the dataset directly from SAMHSA TEDS Data.
226330
> and place it in the `1_datasets/raw/` folder.
331+
> x refers to A and D both.
227332
228333
### Reproducing Results
229334

230335
To reproduce cleaning and preprocessing:
231336

232-
1. Download `tedsa_puf_2023.csv`
337+
1. Download `tedsa_puf_2023.csv` and `tedsd_puf_2023.csv`
233338
2. Place it in `1_datasets/raw/`
234-
3. Run `2_data_preparation/cleaning_preprocessing.ipynb` and
235-
`2_data_preparation\missing_value_handling.ipynb`
339+
3. Run `2_data_preparation/cleaning_teds_a.ipynb`,
340+
`2_data_preparation/cleaning_teds_d.ipynb`
341+
and `2_data_preparation/missing_value_handling_teds_a.ipynb`,
342+
`2_data_preparation/missing_value_handling_teds_d.ipynb`.
236343
4. Output files will appear in `1_datasets/processed/`

0 commit comments

Comments
 (0)