Skip to content

Commit 2b635df

Browse files
committed
feat: add CSV, JSON, and Parquet URL data providers
- Add BaseURLDataProvider with shared caching, refresh intervals, date parsing, and pre/post-processing logic - Add CSVURLDataProvider, JSONURLDataProvider, ParquetURLDataProvider - Add DataSource.from_csv(), from_json(), from_parquet() factory methods - Add context.fetch_csv(), fetch_json(), fetch_parquet() on-demand methods - Fix cache path resolution for cloud deployments (AWS Lambda / Azure Functions) - Add 43 tests covering all three providers - Add external-data.md documentation page - Update README with external data capabilities Closes #458
1 parent 6d564f3 commit 2b635df

15 files changed

Lines changed: 1840 additions & 13 deletions

File tree

README.md

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,7 @@ This framework is built around the full loop: **create strategies → backtest t
7676
- 🎯 **Return Scenario Projections** — Good, average, bad & very bad year projections from backtest data
7777
- 📉 **Benchmark Comparison** — Beat-rate analysis vs Buy & Hold, DCA, risk-free & custom benchmarks
7878
- 📄 **One-Click HTML Report** — Self-contained file, no server, dark & light theme, shareable
79+
- 🌐 **Load External Data** — Fetch CSV, JSON, or Parquet from any URL with caching and auto-refresh
7980
- 🚀 **Build → Backtest → Deploy** — Local dev, cloud deploy (AWS / Azure), or monetize on Finterion
8081

8182
</details>
@@ -262,15 +263,17 @@ report.save("my_report.html")
262263

263264
| | |
264265
|---|---|
265-
| **Backtest Report Dashboard** | Self-contained HTML report with ranking tables, equity curves, metric charts, heatmaps, and strategy comparison |
266-
| **Event-Driven Backtesting** | Realistic, order-by-order simulation |
267-
| **Vectorized Backtesting** | Fast signal research and prototyping |
266+
| **[Backtest Report Dashboard](https://coding-kitties.github.io/investing-algorithm-framework/Getting%20Started/backtest-reports)** | Self-contained HTML report with ranking tables, equity curves, metric charts, heatmaps, and strategy comparison |
267+
| **[Event-Driven Backtesting](https://coding-kitties.github.io/investing-algorithm-framework/Getting%20Started/backtesting)** | Realistic, order-by-order simulation |
268+
| **[Vectorized Backtesting](https://coding-kitties.github.io/investing-algorithm-framework/Advanced%20Concepts/vector-backtesting)** | Fast signal research and prototyping |
268269
| **50+ Metrics** | CAGR, Sharpe, Sortino, max drawdown, win rate, profit factor, recovery factor, volatility, and more |
269-
| **Live Trading** | Connect to exchanges via CCXT for real-time execution |
270-
| **Portfolio Management** | Position tracking, trade management, persistence |
271-
| **Cloud Deployment** | Deploy to AWS Lambda, Azure Functions, or run as a web service |
272-
| **Market Data** | OHLCV, tickers, custom data — Polars and Pandas native |
273-
| **Extensible** | Custom data providers, order executors, and strategy classes |
270+
| **[Live Trading](https://coding-kitties.github.io/investing-algorithm-framework/Getting%20Started/application-setup)** | Connect to exchanges via CCXT for real-time execution |
271+
| **[Portfolio Management](https://coding-kitties.github.io/investing-algorithm-framework/Getting%20Started/portfolio-configuration)** | Position tracking, trade management, persistence |
272+
| **[Cloud Deployment](https://coding-kitties.github.io/investing-algorithm-framework/Getting%20Started/deployment)** | Deploy to AWS Lambda, Azure Functions, or run as a web service |
273+
| **[Market Data Providers](https://coding-kitties.github.io/investing-algorithm-framework/Advanced%20Concepts/custom-data-providers)** | Built-in providers for CCXT, Yahoo Finance, Alpha Vantage, and Polygon — or build your own |
274+
| **[Load External Data](https://coding-kitties.github.io/investing-algorithm-framework/Data/external-data)** | Fetch CSV, JSON, or Parquet from any URL with caching, date parsing, and pre/post-processing |
275+
| **[Strategies](https://coding-kitties.github.io/investing-algorithm-framework/Getting%20Started/strategies)** | OHLCV, tickers, custom data — Polars and Pandas native |
276+
| **[Extensible](https://coding-kitties.github.io/investing-algorithm-framework/Advanced%20Concepts/custom-data-providers)** | Custom data providers, order executors, and strategy classes |
274277

275278
</details>
276279

Lines changed: 221 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,221 @@
1+
---
2+
sidebar_position: 4
3+
---
4+
5+
# Load External Data
6+
7+
Load CSV, JSON, or Parquet data from any URL during strategy execution or as a pre-declared data source. Use this to bring in sentiment scores, earnings data, macro indicators, or any external dataset.
8+
9+
## Overview
10+
11+
The framework provides two ways to load external data:
12+
13+
1. **Data Sources** — Declare `DataSource.from_csv()`, `DataSource.from_json()`, or `DataSource.from_parquet()` in your strategy's `data_sources` list. Data is fetched automatically and available in your `data` dict.
14+
2. **Context methods** — Call `context.fetch_csv()`, `context.fetch_json()`, or `context.fetch_parquet()` on demand inside your strategy's `run_strategy` method.
15+
16+
Both approaches support caching, refresh intervals, date parsing, and pre/post-processing callbacks.
17+
18+
## Supported Formats
19+
20+
| Format | DataSource Factory | Context Method | Provider Class |
21+
|--------|-------------------|----------------|----------------|
22+
| CSV | `DataSource.from_csv()` | `context.fetch_csv()` | `CSVURLDataProvider` |
23+
| JSON | `DataSource.from_json()` | `context.fetch_json()` | `JSONURLDataProvider` |
24+
| Parquet | `DataSource.from_parquet()` | `context.fetch_parquet()` | `ParquetURLDataProvider` |
25+
26+
## Using DataSource (Pre-Declared)
27+
28+
Declare external data sources alongside your market data. The framework fetches them automatically before your strategy runs.
29+
30+
### CSV
31+
32+
```python
33+
from investing_algorithm_framework import TradingStrategy, DataSource, TimeUnit
34+
35+
class SentimentStrategy(TradingStrategy):
36+
time_unit = TimeUnit.DAY
37+
interval = 1
38+
symbols = ["BTC"]
39+
40+
data_sources = [
41+
DataSource.from_csv(
42+
identifier="sentiment",
43+
url="https://example.com/crypto_sentiment.csv",
44+
date_column="date",
45+
date_format="%Y-%m-%d",
46+
cache=True,
47+
refresh_interval="1d",
48+
),
49+
]
50+
51+
def run_strategy(self, context, data):
52+
sentiment_df = data["sentiment"]
53+
latest_score = sentiment_df["score"][-1]
54+
55+
if latest_score > 0.7:
56+
context.create_limit_order(...)
57+
```
58+
59+
### JSON
60+
61+
```python
62+
data_sources = [
63+
DataSource.from_json(
64+
identifier="earnings",
65+
url="https://api.example.com/earnings.json",
66+
date_column="report_date",
67+
date_format="%Y-%m-%d",
68+
refresh_interval="1d",
69+
),
70+
]
71+
```
72+
73+
The JSON data must be either:
74+
- An **array of objects** (records orientation): `[{"date": "2024-01-01", "value": 42}, ...]`
75+
- An **object of arrays** (columnar orientation): `{"date": ["2024-01-01", ...], "value": [42, ...]}`
76+
77+
### Parquet
78+
79+
```python
80+
data_sources = [
81+
DataSource.from_parquet(
82+
identifier="features",
83+
url="https://storage.example.com/features.parquet",
84+
date_column="date",
85+
refresh_interval="1W",
86+
),
87+
]
88+
```
89+
90+
> **Note:** Parquet is a binary format, so `pre_process` callbacks are not supported. Use `post_process` instead.
91+
92+
## Using Context Methods (On-Demand)
93+
94+
Fetch data dynamically inside your strategy without pre-declaring it as a data source. Useful when the URL depends on runtime values or when you only need data conditionally.
95+
96+
```python
97+
class MyStrategy(TradingStrategy):
98+
time_unit = TimeUnit.DAY
99+
interval = 1
100+
symbols = ["BTC"]
101+
102+
def run_strategy(self, context, data):
103+
# Fetch CSV on demand
104+
sentiment = context.fetch_csv(
105+
url="https://example.com/sentiment.csv",
106+
date_column="date",
107+
cache=True,
108+
refresh_interval="1d",
109+
)
110+
111+
# Fetch JSON on demand
112+
earnings = context.fetch_json(
113+
url="https://api.example.com/earnings",
114+
date_column="report_date",
115+
)
116+
117+
# Fetch Parquet on demand
118+
features = context.fetch_parquet(
119+
url="https://storage.example.com/features.parquet",
120+
)
121+
```
122+
123+
## Parameters
124+
125+
All three factory methods and context methods accept the same core parameters:
126+
127+
| Parameter | Type | Default | Description |
128+
|-----------|------|---------|-------------|
129+
| `identifier` | `str` || Unique identifier (DataSource only). |
130+
| `url` | `str` || URL to fetch the data from. |
131+
| `date_column` | `str` | `None` | Name of the column containing dates. Parsed to `polars.Datetime`. |
132+
| `date_format` | `str` | `None` | strftime format for parsing dates (e.g., `"%Y-%m-%d"`). Auto-detected if omitted. |
133+
| `cache` | `bool` | `True` | Cache fetched data locally to avoid repeated downloads. |
134+
| `refresh_interval` | `str` | `None` | How often to re-fetch: `"1m"`, `"5m"`, `"15m"`, `"30m"`, `"1h"`, `"4h"`, `"1d"`, `"1W"`. |
135+
| `pre_process` | `callable` | `None` | Transform raw text before parsing. Receives `str`, returns `str`. Not available for Parquet. |
136+
| `post_process` | `callable` | `None` | Transform the parsed DataFrame. Receives `DataFrame`, returns `DataFrame`. |
137+
138+
## Pre/Post Processing
139+
140+
### Pre-Processing
141+
142+
Clean or transform raw text before it is parsed into a DataFrame. Useful for removing comment lines, fixing delimiters, or filtering rows.
143+
144+
```python
145+
def clean_csv(text):
146+
"""Remove comment lines starting with #."""
147+
lines = [line for line in text.split("\n") if not line.startswith("#")]
148+
return "\n".join(lines)
149+
150+
DataSource.from_csv(
151+
identifier="cleaned_data",
152+
url="https://example.com/messy_data.csv",
153+
pre_process=clean_csv,
154+
)
155+
```
156+
157+
### Post-Processing
158+
159+
Transform the parsed DataFrame — add computed columns, filter rows, change types.
160+
161+
```python
162+
import polars as pl
163+
164+
def add_z_score(df):
165+
"""Add a z-score column for the score field."""
166+
mean = df["score"].mean()
167+
std = df["score"].std()
168+
return df.with_columns(
169+
((pl.col("score") - mean) / std).alias("z_score")
170+
)
171+
172+
DataSource.from_csv(
173+
identifier="scored_data",
174+
url="https://example.com/scores.csv",
175+
post_process=add_z_score,
176+
)
177+
```
178+
179+
## Caching
180+
181+
External data is cached both **in memory** and **on disk**:
182+
183+
- **In-memory cache** — Avoids re-parsing on every strategy tick.
184+
- **File cache** — Stored inside your resource directory (e.g., `resources/data/`). Falls back to `.data_cache/` if no resource directory is configured.
185+
- **Refresh interval** — When set, the framework re-fetches data after the specified interval expires. Staleness is determined by the cache file's modification time, so it survives process restarts.
186+
187+
Cache files are named using an MD5 hash of the URL, so different URLs never collide.
188+
189+
### Cloud Deployments (AWS Lambda / Azure Functions)
190+
191+
When a state handler (`AWSS3StorageStateHandler` or `AzureBlobStorageStateHandler`) is configured, cache files are automatically persisted because they live inside the resource directory that state handlers sync:
192+
193+
1. **Load state** — Cache files are restored from S3 / Azure Blob to `resources/data/`
194+
2. **Check interval** — The file modification time is compared against `refresh_interval`
195+
3. **Skip or re-fetch** — Only fetches from the URL if the interval has elapsed
196+
4. **Save state** — Updated cache files are synced back to cloud storage
197+
198+
This means `refresh_interval` works correctly across cold starts — no extra configuration needed.
199+
200+
## Backtesting
201+
202+
External data sources work with backtesting. The framework calls `prepare_backtest_data()` before the backtest starts, and `get_backtest_data()` on each tick. If a `date_column` is configured, data is automatically filtered to only include rows up to the current backtest date.
203+
204+
```python
205+
data_sources = [
206+
DataSource.from_csv(
207+
identifier="macro",
208+
url="https://example.com/macro_indicators.csv",
209+
date_column="date",
210+
date_format="%Y-%m-%d",
211+
),
212+
]
213+
```
214+
215+
During backtesting, `data["macro"]` will only contain rows where `date <= current_backtest_date`, giving you realistic point-in-time data.
216+
217+
## Next Steps
218+
219+
- Learn about [Data Sources](data-sources) for market data configuration
220+
- Explore [Custom Data Providers](../Advanced%20Concepts/custom-data-providers) to build your own provider
221+
- Check out [Strategies](../Getting%20Started/strategies) for strategy implementation

docusaurus/sidebars.js

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,10 @@ const sidebars = {
6767
type: 'doc',
6868
id: 'Data/multiple-market-data-sources',
6969
},
70+
{
71+
type: 'doc',
72+
id: 'Data/external-data',
73+
},
7074
],
7175
},
7276
{

investing_algorithm_framework/__init__.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,8 @@
2626
save_backtests_to_directory, BacktestMetrics, DATA_DIRECTORY, \
2727
retag_backtests
2828
from .infrastructure import AzureBlobStorageStateHandler, \
29-
CSVOHLCVDataProvider, CSVTickerDataProvider, \
29+
CSVOHLCVDataProvider, CSVTickerDataProvider, CSVURLDataProvider, \
30+
JSONURLDataProvider, ParquetURLDataProvider, \
3031
CCXTOHLCVDataProvider, CCXTTickerDataProvider, \
3132
PandasOHLCVDataProvider, OHLCVDataProviderBase, \
3233
YahooOHLCVDataProvider, \
@@ -116,7 +117,8 @@
116117
'DataType',
117118
'CSVOHLCVDataProvider',
118119
'CSVTickerDataProvider',
119-
"CCXTOHLCVDataProvider",
120+
'CSVURLDataProvider', 'JSONURLDataProvider',
121+
'ParquetURLDataProvider', "CCXTOHLCVDataProvider",
120122
"CCXTTickerDataProvider",
121123
"OHLCVDataProviderBase",
122124
"YahooOHLCVDataProvider",

0 commit comments

Comments
 (0)