|
| 1 | +--- |
| 2 | +sidebar_position: 4 |
| 3 | +--- |
| 4 | + |
| 5 | +# Load External Data |
| 6 | + |
| 7 | +Load CSV, JSON, or Parquet data from any URL during strategy execution or as a pre-declared data source. Use this to bring in sentiment scores, earnings data, macro indicators, or any external dataset. |
| 8 | + |
| 9 | +## Overview |
| 10 | + |
| 11 | +The framework provides two ways to load external data: |
| 12 | + |
| 13 | +1. **Data Sources** — Declare `DataSource.from_csv()`, `DataSource.from_json()`, or `DataSource.from_parquet()` in your strategy's `data_sources` list. Data is fetched automatically and available in your `data` dict. |
| 14 | +2. **Context methods** — Call `context.fetch_csv()`, `context.fetch_json()`, or `context.fetch_parquet()` on demand inside your strategy's `run_strategy` method. |
| 15 | + |
| 16 | +Both approaches support caching, refresh intervals, date parsing, and pre/post-processing callbacks. |
| 17 | + |
| 18 | +## Supported Formats |
| 19 | + |
| 20 | +| Format | DataSource Factory | Context Method | Provider Class | |
| 21 | +|--------|-------------------|----------------|----------------| |
| 22 | +| CSV | `DataSource.from_csv()` | `context.fetch_csv()` | `CSVURLDataProvider` | |
| 23 | +| JSON | `DataSource.from_json()` | `context.fetch_json()` | `JSONURLDataProvider` | |
| 24 | +| Parquet | `DataSource.from_parquet()` | `context.fetch_parquet()` | `ParquetURLDataProvider` | |
| 25 | + |
| 26 | +## Using DataSource (Pre-Declared) |
| 27 | + |
| 28 | +Declare external data sources alongside your market data. The framework fetches them automatically before your strategy runs. |
| 29 | + |
| 30 | +### CSV |
| 31 | + |
| 32 | +```python |
| 33 | +from investing_algorithm_framework import TradingStrategy, DataSource, TimeUnit |
| 34 | + |
| 35 | +class SentimentStrategy(TradingStrategy): |
| 36 | + time_unit = TimeUnit.DAY |
| 37 | + interval = 1 |
| 38 | + symbols = ["BTC"] |
| 39 | + |
| 40 | + data_sources = [ |
| 41 | + DataSource.from_csv( |
| 42 | + identifier="sentiment", |
| 43 | + url="https://example.com/crypto_sentiment.csv", |
| 44 | + date_column="date", |
| 45 | + date_format="%Y-%m-%d", |
| 46 | + cache=True, |
| 47 | + refresh_interval="1d", |
| 48 | + ), |
| 49 | + ] |
| 50 | + |
| 51 | + def run_strategy(self, context, data): |
| 52 | + sentiment_df = data["sentiment"] |
| 53 | + latest_score = sentiment_df["score"][-1] |
| 54 | + |
| 55 | + if latest_score > 0.7: |
| 56 | + context.create_limit_order(...) |
| 57 | +``` |
| 58 | + |
| 59 | +### JSON |
| 60 | + |
| 61 | +```python |
| 62 | +data_sources = [ |
| 63 | + DataSource.from_json( |
| 64 | + identifier="earnings", |
| 65 | + url="https://api.example.com/earnings.json", |
| 66 | + date_column="report_date", |
| 67 | + date_format="%Y-%m-%d", |
| 68 | + refresh_interval="1d", |
| 69 | + ), |
| 70 | +] |
| 71 | +``` |
| 72 | + |
| 73 | +The JSON data must be either: |
| 74 | +- An **array of objects** (records orientation): `[{"date": "2024-01-01", "value": 42}, ...]` |
| 75 | +- An **object of arrays** (columnar orientation): `{"date": ["2024-01-01", ...], "value": [42, ...]}` |
| 76 | + |
| 77 | +### Parquet |
| 78 | + |
| 79 | +```python |
| 80 | +data_sources = [ |
| 81 | + DataSource.from_parquet( |
| 82 | + identifier="features", |
| 83 | + url="https://storage.example.com/features.parquet", |
| 84 | + date_column="date", |
| 85 | + refresh_interval="1W", |
| 86 | + ), |
| 87 | +] |
| 88 | +``` |
| 89 | + |
| 90 | +> **Note:** Parquet is a binary format, so `pre_process` callbacks are not supported. Use `post_process` instead. |
| 91 | +
|
| 92 | +## Using Context Methods (On-Demand) |
| 93 | + |
| 94 | +Fetch data dynamically inside your strategy without pre-declaring it as a data source. Useful when the URL depends on runtime values or when you only need data conditionally. |
| 95 | + |
| 96 | +```python |
| 97 | +class MyStrategy(TradingStrategy): |
| 98 | + time_unit = TimeUnit.DAY |
| 99 | + interval = 1 |
| 100 | + symbols = ["BTC"] |
| 101 | + |
| 102 | + def run_strategy(self, context, data): |
| 103 | + # Fetch CSV on demand |
| 104 | + sentiment = context.fetch_csv( |
| 105 | + url="https://example.com/sentiment.csv", |
| 106 | + date_column="date", |
| 107 | + cache=True, |
| 108 | + refresh_interval="1d", |
| 109 | + ) |
| 110 | + |
| 111 | + # Fetch JSON on demand |
| 112 | + earnings = context.fetch_json( |
| 113 | + url="https://api.example.com/earnings", |
| 114 | + date_column="report_date", |
| 115 | + ) |
| 116 | + |
| 117 | + # Fetch Parquet on demand |
| 118 | + features = context.fetch_parquet( |
| 119 | + url="https://storage.example.com/features.parquet", |
| 120 | + ) |
| 121 | +``` |
| 122 | + |
| 123 | +## Parameters |
| 124 | + |
| 125 | +All three factory methods and context methods accept the same core parameters: |
| 126 | + |
| 127 | +| Parameter | Type | Default | Description | |
| 128 | +|-----------|------|---------|-------------| |
| 129 | +| `identifier` | `str` | — | Unique identifier (DataSource only). | |
| 130 | +| `url` | `str` | — | URL to fetch the data from. | |
| 131 | +| `date_column` | `str` | `None` | Name of the column containing dates. Parsed to `polars.Datetime`. | |
| 132 | +| `date_format` | `str` | `None` | strftime format for parsing dates (e.g., `"%Y-%m-%d"`). Auto-detected if omitted. | |
| 133 | +| `cache` | `bool` | `True` | Cache fetched data locally to avoid repeated downloads. | |
| 134 | +| `refresh_interval` | `str` | `None` | How often to re-fetch: `"1m"`, `"5m"`, `"15m"`, `"30m"`, `"1h"`, `"4h"`, `"1d"`, `"1W"`. | |
| 135 | +| `pre_process` | `callable` | `None` | Transform raw text before parsing. Receives `str`, returns `str`. Not available for Parquet. | |
| 136 | +| `post_process` | `callable` | `None` | Transform the parsed DataFrame. Receives `DataFrame`, returns `DataFrame`. | |
| 137 | + |
| 138 | +## Pre/Post Processing |
| 139 | + |
| 140 | +### Pre-Processing |
| 141 | + |
| 142 | +Clean or transform raw text before it is parsed into a DataFrame. Useful for removing comment lines, fixing delimiters, or filtering rows. |
| 143 | + |
| 144 | +```python |
| 145 | +def clean_csv(text): |
| 146 | + """Remove comment lines starting with #.""" |
| 147 | + lines = [line for line in text.split("\n") if not line.startswith("#")] |
| 148 | + return "\n".join(lines) |
| 149 | + |
| 150 | +DataSource.from_csv( |
| 151 | + identifier="cleaned_data", |
| 152 | + url="https://example.com/messy_data.csv", |
| 153 | + pre_process=clean_csv, |
| 154 | +) |
| 155 | +``` |
| 156 | + |
| 157 | +### Post-Processing |
| 158 | + |
| 159 | +Transform the parsed DataFrame — add computed columns, filter rows, change types. |
| 160 | + |
| 161 | +```python |
| 162 | +import polars as pl |
| 163 | + |
| 164 | +def add_z_score(df): |
| 165 | + """Add a z-score column for the score field.""" |
| 166 | + mean = df["score"].mean() |
| 167 | + std = df["score"].std() |
| 168 | + return df.with_columns( |
| 169 | + ((pl.col("score") - mean) / std).alias("z_score") |
| 170 | + ) |
| 171 | + |
| 172 | +DataSource.from_csv( |
| 173 | + identifier="scored_data", |
| 174 | + url="https://example.com/scores.csv", |
| 175 | + post_process=add_z_score, |
| 176 | +) |
| 177 | +``` |
| 178 | + |
| 179 | +## Caching |
| 180 | + |
| 181 | +External data is cached both **in memory** and **on disk**: |
| 182 | + |
| 183 | +- **In-memory cache** — Avoids re-parsing on every strategy tick. |
| 184 | +- **File cache** — Stored inside your resource directory (e.g., `resources/data/`). Falls back to `.data_cache/` if no resource directory is configured. |
| 185 | +- **Refresh interval** — When set, the framework re-fetches data after the specified interval expires. Staleness is determined by the cache file's modification time, so it survives process restarts. |
| 186 | + |
| 187 | +Cache files are named using an MD5 hash of the URL, so different URLs never collide. |
| 188 | + |
| 189 | +### Cloud Deployments (AWS Lambda / Azure Functions) |
| 190 | + |
| 191 | +When a state handler (`AWSS3StorageStateHandler` or `AzureBlobStorageStateHandler`) is configured, cache files are automatically persisted because they live inside the resource directory that state handlers sync: |
| 192 | + |
| 193 | +1. **Load state** — Cache files are restored from S3 / Azure Blob to `resources/data/` |
| 194 | +2. **Check interval** — The file modification time is compared against `refresh_interval` |
| 195 | +3. **Skip or re-fetch** — Only fetches from the URL if the interval has elapsed |
| 196 | +4. **Save state** — Updated cache files are synced back to cloud storage |
| 197 | + |
| 198 | +This means `refresh_interval` works correctly across cold starts — no extra configuration needed. |
| 199 | + |
| 200 | +## Backtesting |
| 201 | + |
| 202 | +External data sources work with backtesting. The framework calls `prepare_backtest_data()` before the backtest starts, and `get_backtest_data()` on each tick. If a `date_column` is configured, data is automatically filtered to only include rows up to the current backtest date. |
| 203 | + |
| 204 | +```python |
| 205 | +data_sources = [ |
| 206 | + DataSource.from_csv( |
| 207 | + identifier="macro", |
| 208 | + url="https://example.com/macro_indicators.csv", |
| 209 | + date_column="date", |
| 210 | + date_format="%Y-%m-%d", |
| 211 | + ), |
| 212 | +] |
| 213 | +``` |
| 214 | + |
| 215 | +During backtesting, `data["macro"]` will only contain rows where `date <= current_backtest_date`, giving you realistic point-in-time data. |
| 216 | + |
| 217 | +## Next Steps |
| 218 | + |
| 219 | +- Learn about [Data Sources](data-sources) for market data configuration |
| 220 | +- Explore [Custom Data Providers](../Advanced%20Concepts/custom-data-providers) to build your own provider |
| 221 | +- Check out [Strategies](../Getting%20Started/strategies) for strategy implementation |
0 commit comments