Skip to content

Commit 064e67c

Browse files
authored
Merge pull request freqtrade#13227 from yongzhe2160cs/feature/backtest-profit-pvalue
Add mean-trade-return p-value to backtest summary metrics
2 parents 1c4d33c + f77facc commit 064e67c

9 files changed

Lines changed: 99 additions & 1 deletion

File tree

docs/backtesting.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -229,6 +229,7 @@ A backtesting result will look like that:
229229
│ Sortino (closed trades) │ 2.57 │
230230
│ Calmar (closed trades) │ 43.03 │
231231
│ SQN │ 0.71 │
232+
│ Mean profit p-value │ 0.4768 │
232233
│ Profit factor │ 1.30 │
233234
│ Expectancy (Ratio) │ 0.74 (0.04) │
234235
│ Avg. daily profit │ 1.844 USDT │
@@ -362,6 +363,7 @@ It contains key metrics about the performance of your strategy on backtesting da
362363
│ Sortino (closed trades) │ 2.57 │
363364
│ Calmar (closed trades) │ 43.03 │
364365
│ SQN │ 0.71 │
366+
│ Mean profit p-value │ 0.4768 │
365367
│ Profit factor │ 1.30 │
366368
│ Expectancy (Ratio) │ 0.74 (0.04) │
367369
│ Avg. daily profit │ 1.844 USDT │
@@ -424,6 +426,7 @@ It contains key metrics about the performance of your strategy on backtesting da
424426
- `Sortino (closed trades)`: Annualized Sortino ratio including only closed trades (ignoring open trades with profits or losses).
425427
- `Calmar (closed trades)`: Annualized Calmar ratio including only closed trades (ignoring open trades with profits or losses).
426428
- `SQN`: System Quality Number (SQN) - by Van Tharp.
429+
- `Mean profit p-value`: Two-sided p-value of a one-sample Student's t-test against the null hypothesis that the mean per-trade return is zero - in short, "is the average profit distinguishable from noise?". A small value (the usual bar is below `0.05`) means the observed edge is unlikely to be down to chance. Its underlying t-statistic is identical to `SQN`. See the note below for how to read it in practice.
427430
- `Profit factor`: Sum of the profits of all winning trades divided by the sum of the losses of all losing trades.
428431
- `Expectancy (Ratio)`: Expectancy ratio, which is the average profit or loss per trade. A negative expectancy ratio means that your strategy is not profitable.
429432
- `Avg. daily profit`: Average profit per day, calculated as `(Total Profit / Backtest Days)`.
@@ -455,6 +458,11 @@ It contains key metrics about the performance of your strategy on backtesting da
455458
- `Sortino (wallet balance)` Annualized Sortino ratio calculation including unrealized profits.
456459
- `Calmar (wallet balance)` Annualized Calmar ratio calculation including unrealized profits.
457460

461+
??? Note "Reading the mean profit p-value"
462+
Think of the p-value as the answer to one question: *if the strategy truly had no edge, how often would pure chance still hand you an average per-trade result at least this far from zero?* A value of `0.4768` therefore means roughly a 48% chance of a swing this large turning up from randomness alone - in other words the average profit is not distinguishable from luck. The lower the p-value, the less likely the result is a fluke, and a common rule of thumb is to treat anything below `0.05` (a 5% chance) as "statistically significant".
463+
464+
Two things keep this honest. The test assumes trades are independent and identically distributed, which real strategies rarely are (trades overlap and cluster in time), so the figure is an *optimistic* lower bound - the true uncertainty is usually larger. And because backtesting and hyperopt evaluate many strategies, some will score a low p-value by chance alone, so a small value only tells you a result is hard to explain by noise; it is not by itself proof of a genuine edge.
465+
458466
!!! Tip "Wallet based Metrics"
459467
The metrics under the "Wallet based Metrics" section are calculated based on the unrealized balance, which includes the capital tied in open trades. This provides a more comprehensive view of the strategy's performance, as it accounts for both realized and unrealized profits and losses.
460468

freqtrade/data/metrics.py

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55

66
import numpy as np
77
import pandas as pd
8+
from scipy import stats
89

910

1011
logger = logging.getLogger(__name__)
@@ -632,3 +633,24 @@ def calculate_sqn(trades: pd.DataFrame, starting_balance: float) -> float:
632633
sqn = -100.0
633634

634635
return round(sqn, 4)
636+
637+
638+
def calculate_p_value(trades: pd.DataFrame, starting_balance: float) -> float:
639+
"""
640+
Two-sided p-value for the null hypothesis that mean per-trade profit
641+
(profit_abs / starting_balance) equals zero.
642+
Returns 1.0 for fewer than 2 trades or zero-variance samples.
643+
644+
:param trades: DataFrame containing trades (requires column profit_abs)
645+
:param starting_balance: Starting balance of the trading system
646+
:return: Two-sided p-value in the range [0, 1]. Returns 1.0 (no evidence
647+
against the null) when it cannot be computed - fewer than two
648+
trades or zero return variance.
649+
"""
650+
if len(trades) < 2:
651+
return 1.0
652+
returns = trades["profit_abs"] / starting_balance
653+
if returns.std() == 0:
654+
return 1.0
655+
_, p_value = stats.ttest_1samp(returns, popmean=0)
656+
return float(p_value)

freqtrade/optimize/optimize_reports/bt_output.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -405,6 +405,10 @@ def text_table_add_metrics(strat_results: dict) -> None:
405405
f"{strat_results['calmar']:.2f}" if "calmar" in strat_results else "N/A",
406406
),
407407
("SQN", f"{strat_results['sqn']:.2f}" if "sqn" in strat_results else "N/A"),
408+
(
409+
"Mean profit p-value",
410+
(f"{strat_results['p_value']:.4g}" if "p_value" in strat_results else "N/A"),
411+
),
408412
(
409413
"Profit factor",
410414
(

freqtrade/optimize/optimize_reports/optimize_reports.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
calculate_market_change,
1717
calculate_max_drawdown,
1818
calculate_max_drawdown_from_balance,
19+
calculate_p_value,
1920
calculate_sharpe,
2021
calculate_sharpe_from_balance,
2122
calculate_sortino,
@@ -224,6 +225,7 @@ def _generate_result_line(
224225
"sharpe": calculate_sharpe(result, min_date, max_date, starting_balance),
225226
"calmar": calculate_calmar(result, min_date, max_date, starting_balance),
226227
"sqn": calculate_sqn(result, starting_balance),
228+
"p_value": calculate_p_value(result, starting_balance),
227229
"profit_factor": profit_factor,
228230
"max_drawdown_account": drawdown.relative_account_drawdown if drawdown else 0.0,
229231
"max_drawdown_abs": drawdown.drawdown_abs if drawdown else 0.0,
@@ -684,6 +686,7 @@ def generate_strategy_stats(
684686
"sharpe": calculate_sharpe(results, min_date, max_date, start_balance),
685687
"calmar": calculate_calmar(results, min_date, max_date, start_balance),
686688
"sqn": calculate_sqn(results, start_balance),
689+
"p_value": calculate_p_value(results, start_balance),
687690
"wallet_stats": generate_wallet_stats(content.get("wallet_summary"), stake_currency),
688691
"profit_factor": profit_factor,
689692
"backtest_start": min_date.strftime(DATETIME_PRINT_FORMAT),

requirements-hyperopt.txt

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@
22
-r requirements.txt
33

44
# Required for hyperopt
5-
scipy==1.17.1
65
scikit-learn==1.9.0
76
filelock==3.29.1
87
optuna==4.9.0

requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ numpy==2.4.6
22
pandas==3.0.3
33
bottleneck==1.6.0
44
numexpr==2.14.1
5+
scipy==1.17.1
56
# Indicator libraries
67
ft-pandas-ta==0.3.16
78
ta-lib==0.6.8

tests/conftest.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -533,6 +533,9 @@ def patch_torch_initlogs(mocker) -> None:
533533

534534
module_name = "torch"
535535
mocked_module = types.ModuleType(module_name)
536+
# SciPy's array-API dispatch probes ``torch.Tensor`` to classify inputs;
537+
# expose a dummy so scipy.stats stays importable/usable under the mock.
538+
mocked_module.Tensor = type("Tensor", (), {})
536539
sys.modules[module_name] = mocked_module
537540
else:
538541
try:

tests/data/test_metrics.py

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
calculate_market_change,
1919
calculate_max_drawdown,
2020
calculate_max_drawdown_from_balance,
21+
calculate_p_value,
2122
calculate_sharpe,
2223
calculate_sharpe_from_balance,
2324
calculate_sortino,
@@ -442,6 +443,56 @@ def test_calculate_sqn_cases(profits, starting_balance, expected_sqn, descriptio
442443
assert pytest.approx(sqn, rel=1e-4) == expected_sqn
443444

444445

446+
def test_calculate_p_value_edge_cases():
447+
# Fewer than two trades -> not computable, returns "no evidence" default.
448+
assert calculate_p_value(DataFrame({"profit_abs": []}), 100) == 1.0
449+
assert calculate_p_value(DataFrame({"profit_abs": [1.0]}), 100) == 1.0
450+
451+
# Zero variance (all identical returns) -> not computable.
452+
assert calculate_p_value(DataFrame({"profit_abs": [1.0, 1.0, 1.0]}), 100) == 1.0
453+
454+
# p-value is always within [0, 1].
455+
p_value = calculate_p_value(DataFrame({"profit_abs": [1.0, -0.5, 2.0, -1.0]}), 100)
456+
assert 0.0 <= p_value <= 1.0
457+
458+
459+
def test_calculate_p_value_scale_invariance():
460+
# The t-statistic, and hence the p-value, is invariant to the stake scale.
461+
profits = [1.0, -0.5, 2.0, -1.0, 0.5, 1.5, -0.5, 1.0]
462+
trades = DataFrame({"profit_abs": profits})
463+
p_small = calculate_p_value(trades, starting_balance=10)
464+
p_large = calculate_p_value(trades, starting_balance=100_000)
465+
assert pytest.approx(p_small, rel=1e-9) == p_large
466+
467+
468+
def test_calculate_p_value_matches_reference():
469+
"""
470+
calculate_p_value must match scipy.stats.ttest_1samp, the canonical
471+
reference, computed live for each case.
472+
"""
473+
from scipy import stats
474+
475+
cases = [
476+
[0.01, -0.005, 0.02, 0.015, -0.01],
477+
[0.05, 0.04, 0.06, 0.045, 0.055],
478+
[-0.01, -0.02, -0.015, -0.005, -0.025],
479+
[0.001, -0.001, 0.001, -0.001],
480+
]
481+
starting_balance = 1000.0
482+
for returns in cases:
483+
trades = DataFrame({"profit_abs": [r * starting_balance for r in returns]})
484+
result = calculate_p_value(trades, starting_balance)
485+
_, expected = stats.ttest_1samp(returns, popmean=0)
486+
assert abs(result - float(expected)) < 1e-10
487+
488+
489+
def test_calculate_p_value_zero_mean():
490+
# A strategy whose average trade is exactly break-even has a t-statistic of
491+
# zero -> p-value of exactly 1.0 (entirely indistinguishable from noise).
492+
trades = DataFrame({"profit_abs": [1.0, -1.0, 2.0, -2.0]})
493+
assert calculate_p_value(trades, starting_balance=100) == 1.0
494+
495+
445496
@pytest.mark.parametrize(
446497
"start,end,days, expected",
447498
[

tests/optimize/test_optimize_reports.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -232,6 +232,9 @@ def test_generate_backtest_stats(default_conf, testdatadir, tmp_path):
232232
assert strat_stats["drawdown_end_ts"] == 1510699380000
233233
assert strat_stats["drawdown_start_ts"] == 1510697400000
234234
assert strat_stats["pairlist"] == ["UNITTEST/BTC"]
235+
# Statistical significance of the mean trade return
236+
assert "p_value" in strat_stats
237+
assert strat_stats["p_value"] == pytest.approx(0.8957701627)
235238

236239
# Test storing stats
237240
filename = tmp_path / "btresult.json"
@@ -666,13 +669,17 @@ def test_text_table_add_metrics_shows_wallet_ratios(testdatadir, capsys):
666669
"max_drawdown_low": 0.95,
667670
}
668671

672+
strat_results["p_value"] = 0.0321
673+
669674
text_table_add_metrics(strat_results)
670675
text = capsys.readouterr().out
671676

672677
assert "Sharpe (daily wallet balance)" in text
673678
assert "Sortino (daily wallet balance)" in text
674679
assert "Calmar (daily wallet balance)" in text
675680
assert "Max % of account underwater (balance)" in text
681+
assert "Mean profit p-value" in text
682+
assert "0.0321" in text
676683

677684

678685
def test_generate_periodic_breakdown_stats(testdatadir):

0 commit comments

Comments
 (0)