This repository contains the code for the experiments in our paper, "Rethinking Tokenization for Clinical Time Series: When Less is More." The codebase is adapted from the meds-torch library. We thank the original authors for their foundational work. For the maintained version, please see the official repository.
This work presents a systematic evaluation of tokenization approaches for clinical time series modeling. We compare Triplet and TextCode strategies across four prediction tasks on MIMIC-IV to investigate the roles of time, value, and code representations. Our findings suggest that for transformer-based models, tokenization can often be simplified without sacrificing performance.
| Component | Finding | Implication |
|---|---|---|
| Time Features | Explicit time encodings showed no statistically significant benefit. | Sequence order in transformers may be sufficient for the tasks studied. |
| Value Features | Importance is task-dependent (critical for mortality, less so for readmission). | Code sequences alone can carry significant predictive signal for some tasks. |
| Frozen Encoders | Tend to outperform trainable encoders with far fewer parameters. | Pretrained knowledge may serve as an effective regularized feature extractor. |
| Code Information | Appears to be the strongest predictive signal across the experiments studied. | Code representation quality may be a key driver of model performance. |
triplet_encoder_time2vec.py- Time2Vec implementation for advanced time encodingtriplet_encoder_lete.py- LeTE (Learnable Time Embeddings) implementationtriplet_encoder_code_only.py- Code-only ablation (no time/value features)triplet_encoder_no_time.py- No-time ablation varianttriplet_encoder_no_value.py- No-value ablation varianttextcode_encoder_flexible.py- Flexible TextCode encoder with trainable/frozen modes
experiment_baseline_multiseed.sh- Baseline Triplet experimentsexperiment_time2vec_multiseed.sh- Time2Vec experimentsexperiment_lete.sh- LeTE experimentsexperiment_code_only.sh- Code-only ablation experimentsexperiment_no_time.sh- No-time ablation experimentsexperiment_no_value.sh- No-value ablation experimentsexperiment_flexible_textcode.sh- TextCode optimization experiments
- Dataset: MIMIC-IV processed into MEDS format
- Tasks: In-hospital mortality, ICU mortality, post-discharge mortality, 30-day readmission
- Framework: MEDS-Torch with transformer encoders
- Evaluation: AUROC with 10 random seeds, statistical significance testing
This research suggests that simpler, more parameter-efficient tokenization approaches may achieve competitive performance in clinical time series modeling, raising questions about the necessity of complex temporal encodings and highlighting the task-dependent role of value features.
If you use this code or build on this work, please cite:
@misc{attrach2025ehrtokenization,
title = {Rethinking Tokenization for Clinical Time Series: When Less is More},
author = {Al Attrach, Rafi and Fani, Rajna and Restrepo, David and Jia, Yugang
and Celi, Leo Anthony and Sch\"{u}ffler, Peter},
year = {2025},
note = {Machine Learning for Health (ML4H) 2025 - Findings Track}
}
