|
| 1 | +# TuShare Daily Data Collector |
| 2 | + |
| 3 | +Collect CN daily equity data from TuShare, normalize to qlib CSV schema, and dump to qlib bin format. Supports full builds and incremental updates. |
| 4 | + |
| 5 | +## Requirements |
| 6 | +- Python venv is recommended; install dependencies: |
| 7 | + ```bash |
| 8 | + python -m pip install tushare plotly torch |
| 9 | + ``` |
| 10 | +- Set `TUSHARE_TOKEN` (e.g., put in `.env` then `export $(cat .env | xargs)`). |
| 11 | +- Default qlib output: `~/.qlib/qlib_data/cn_data`. |
| 12 | + |
| 13 | +## Quick Start (one-shot pipeline) |
| 14 | +Download → normalize → dump in a single command: |
| 15 | +```bash |
| 16 | +python qlib/scripts/data_collector/tushare/collector.py pipeline \ |
| 17 | + --source_dir ./tmp/tushare_raw \ |
| 18 | + --normalize_dir ./tmp/tushare_norm \ |
| 19 | + --qlib_dir ~/.qlib/qlib_data/cn_data \ |
| 20 | + --start 2010-01-01 --end 2024-12-31 \ |
| 21 | + --token "$TUSHARE_TOKEN" |
| 22 | +``` |
| 23 | + |
| 24 | +## Step-by-Step |
| 25 | +1) Download raw TuShare daily data to CSV: |
| 26 | +```bash |
| 27 | +python qlib/scripts/data_collector/tushare/collector.py download_data \ |
| 28 | + --source_dir ./tmp/tushare_raw \ |
| 29 | + --start 2020-01-01 --end 2020-12-31 \ |
| 30 | + --token "$TUSHARE_TOKEN" |
| 31 | +``` |
| 32 | +2) Normalize to qlib-ready CSVs (factor-adjusted prices, volume back-adjusted, symbols normalized): |
| 33 | +```bash |
| 34 | +python qlib/scripts/data_collector/tushare/collector.py normalize_data \ |
| 35 | + --source_dir ./tmp/tushare_raw \ |
| 36 | + --normalize_dir ./tmp/tushare_norm |
| 37 | +``` |
| 38 | +3) Dump normalized CSVs to qlib bin format: |
| 39 | +```bash |
| 40 | +python qlib/scripts/data_collector/tushare/collector.py dump_to_bin \ |
| 41 | + --normalize_dir ./tmp/tushare_norm \ |
| 42 | + --qlib_dir ~/.qlib/qlib_data/cn_data \ |
| 43 | + --mode all |
| 44 | +``` |
| 45 | + |
| 46 | +## Incremental Update |
| 47 | +Update an existing day-level qlib directory with fresh TuShare data: |
| 48 | +```bash |
| 49 | +python qlib/scripts/data_collector/tushare/collector.py update_data_to_bin \ |
| 50 | + --qlib_data_1d_dir ~/.qlib/qlib_data/cn_data \ |
| 51 | + --end_date 2024-12-31 |
| 52 | +``` |
| 53 | +- Starts from the last trading date in `calendars/day.txt` and only dumps newer rows. |
| 54 | +- Reruns `download_data` + `normalize_data` internally and writes incremental bins. |
| 55 | + |
| 56 | +## Validate a qlib Directory |
| 57 | +```bash |
| 58 | +python - <<'PY' |
| 59 | +from qlib.scripts.data_collector.tushare.collector import validate_qlib_dir |
| 60 | +print(validate_qlib_dir("~/.qlib/qlib_data/cn_data", freq="day")) |
| 61 | +PY |
| 62 | +``` |
| 63 | +Returns a dict; values are `None` when calendars, instruments, and feature bins are present. |
| 64 | + |
| 65 | +## Notes |
| 66 | +- Interval: currently 1d only. |
| 67 | +- Required columns fetched: `ts_code`, `trade_date`, `open/high/low/close`, `vol`, `amount`, `adj_factor`. |
| 68 | +- Prices are forward-adjusted by normalized `factor`; volume is back-adjusted by the same factor. |
| 69 | +- Symbols are mapped from `000001.SZ` → `sz000001` to match qlib conventions. |
| 70 | +- `save_instrument` deduplicates by date so reruns will not create duplicate rows. |
| 71 | + |
0 commit comments