Skip to content

Commit ac53284

Browse files
Jaybee HuangJaybee Huang
authored andcommitted
docs: add tushare collector usage and overview entry
1 parent 179bc6b commit ac53284

2 files changed

Lines changed: 72 additions & 0 deletions

File tree

scripts/data_collector/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
Scripts for data collection
66

77
- yahoo: get *US/CN* stock data from *Yahoo Finance*
8+
- tushare: get *CN* daily stock data from *TuShare* with download/normalize/dump pipeline and incremental update
89
- fund: get fund data from *http://fund.eastmoney.com*
910
- cn_index: get *CN index* from *http://www.csindex.com.cn*, *CSI300*/*CSI100*
1011
- us_index: get *US index* from *https://en.wikipedia.org/wiki*, *SP500*/*NASDAQ100*/*DJIA*/*SP400*
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# TuShare Daily Data Collector
2+
3+
Collect CN daily equity data from TuShare, normalize to qlib CSV schema, and dump to qlib bin format. Supports full builds and incremental updates.
4+
5+
## Requirements
6+
- Python venv is recommended; install dependencies:
7+
```bash
8+
python -m pip install tushare plotly torch
9+
```
10+
- Set `TUSHARE_TOKEN` (e.g., put in `.env` then `export $(cat .env | xargs)`).
11+
- Default qlib output: `~/.qlib/qlib_data/cn_data`.
12+
13+
## Quick Start (one-shot pipeline)
14+
Download → normalize → dump in a single command:
15+
```bash
16+
python qlib/scripts/data_collector/tushare/collector.py pipeline \
17+
--source_dir ./tmp/tushare_raw \
18+
--normalize_dir ./tmp/tushare_norm \
19+
--qlib_dir ~/.qlib/qlib_data/cn_data \
20+
--start 2010-01-01 --end 2024-12-31 \
21+
--token "$TUSHARE_TOKEN"
22+
```
23+
24+
## Step-by-Step
25+
1) Download raw TuShare daily data to CSV:
26+
```bash
27+
python qlib/scripts/data_collector/tushare/collector.py download_data \
28+
--source_dir ./tmp/tushare_raw \
29+
--start 2020-01-01 --end 2020-12-31 \
30+
--token "$TUSHARE_TOKEN"
31+
```
32+
2) Normalize to qlib-ready CSVs (factor-adjusted prices, volume back-adjusted, symbols normalized):
33+
```bash
34+
python qlib/scripts/data_collector/tushare/collector.py normalize_data \
35+
--source_dir ./tmp/tushare_raw \
36+
--normalize_dir ./tmp/tushare_norm
37+
```
38+
3) Dump normalized CSVs to qlib bin format:
39+
```bash
40+
python qlib/scripts/data_collector/tushare/collector.py dump_to_bin \
41+
--normalize_dir ./tmp/tushare_norm \
42+
--qlib_dir ~/.qlib/qlib_data/cn_data \
43+
--mode all
44+
```
45+
46+
## Incremental Update
47+
Update an existing day-level qlib directory with fresh TuShare data:
48+
```bash
49+
python qlib/scripts/data_collector/tushare/collector.py update_data_to_bin \
50+
--qlib_data_1d_dir ~/.qlib/qlib_data/cn_data \
51+
--end_date 2024-12-31
52+
```
53+
- Starts from the last trading date in `calendars/day.txt` and only dumps newer rows.
54+
- Reruns `download_data` + `normalize_data` internally and writes incremental bins.
55+
56+
## Validate a qlib Directory
57+
```bash
58+
python - <<'PY'
59+
from qlib.scripts.data_collector.tushare.collector import validate_qlib_dir
60+
print(validate_qlib_dir("~/.qlib/qlib_data/cn_data", freq="day"))
61+
PY
62+
```
63+
Returns a dict; values are `None` when calendars, instruments, and feature bins are present.
64+
65+
## Notes
66+
- Interval: currently 1d only.
67+
- Required columns fetched: `ts_code`, `trade_date`, `open/high/low/close`, `vol`, `amount`, `adj_factor`.
68+
- Prices are forward-adjusted by normalized `factor`; volume is back-adjusted by the same factor.
69+
- Symbols are mapped from `000001.SZ``sz000001` to match qlib conventions.
70+
- `save_instrument` deduplicates by date so reruns will not create duplicate rows.
71+

0 commit comments

Comments
 (0)