Stock Predictor Project is a comprehensive machine learning framework for predicting stock returns and evaluating trading strategies. This project combines state-of-the-art machine learning techniques with traditional financial analysis to create a powerful tool for investment decision-making.
Financial markets generate vast amounts of data, but extracting actionable insights from this data remains challenging. This project addresses this challenge by:
- Applying machine learning to financial prediction: Using various ML algorithms to identify patterns in stock returns that traditional approaches might miss
- Creating a systematic evaluation framework: Moving beyond simple backtest results to apply rigorous statistical tests of performance
- Making financial research reproducible: Providing a complete pipeline from data acquisition to model evaluation
- State-of-the-art input data: CRSP for stock returns, OpenAssetPricing for predictive signals
- Prediction models:
- Included here: Ridge, Lasso, RidgeCV, LassoCV, and Histogram-based Gradient Boosting
- Can easily be expanded to any other model that can be integrated into
sklearnpipelines
- Out-of-sample validation: Rolling training and testing to prevent look-ahead bias
- Portfolio construction: Creation of long-short portfolios based on model predictions
- Performance analysis: Sharpe ratios, factor regression to estimate alpha, and portfolio turnover metrics
- Interactive dashboard: Streamlit web app for comparing performance across many models and understanding each
StockPredictorProject/
├── .gitignore # Git ignore file
├── .streamlit/ # Streamlit configuration
│ └── config.toml # Streamlit theme configuration
├── input_data/ # Input data (not tracked by git)
│ └── crsp_data.csv # CRSP return data
├── models/ # Saved ML model files
├── output_portfolios/ # Generated portfolio returns and statistics
│ ├── port_stats_tall.csv # Portfolio statistics in tall format
│ └── returns_wide.csv # Portfolio returns in wide format
├── predictions/ # Model predictions
│ └── prediction_output.csv # Out-of-sample predictions
├── README.md # This file
├── stock_prediction_and_eval.ipynb # Main Jupyter notebook for analysis
├── streamlit_app.py # Streamlit dashboard for single model exploration
└── streamlit_app2.py # Enhanced multi-page Streamlit dashboard
This project requires the following Python packages:
- Core Data Science: pandas, numpy, matplotlib, seaborn
- Machine Learning: scikit-learn, statsmodels
- Finance: pandas-datareader, openassetpricing
- Visualization: matplotlib, seaborn
- Dashboard: streamlit
- Utils: tqdm, gdown, ydata-profiling
-
Clone the repository:
git clone https://github.com/yourusername/StockPredictorProject_PROF.git cd StockPredictorProject_PROF -
Create and activate a virtual environment (recommended):
python -m venv venv # On Windows venv\Scripts\activate # On macOS/Linux source venv/bin/activate
-
Install dependencies:
pip install -r requirements.txt
-
Download the data:
- If you don't have the necessary data files, the code will attempt to download them from Google Drive when executed.
- Exception: CRSP stock return data requires a WRDS login and I will not distribute it.
To run the full analysis pipeline, open and run the Jupyter notebook (stock_prediction_and_eval.ipynb). This will:
- Load and preprocess the data
- Train various machine learning models
- Generate out-of-sample predictions
- Create and evaluate portfolios
- Save results to the output directories
To launch the interactive dashboard:
streamlit run streamlit_app.pyThe dashboard allows you to:
- Explore model performance through interactive visualizations
- Compare different models side-by-side
- View portfolio returns, factor exposures, and other statistics
- Analyze the performance across different time periods
-
Data Preparation:
- Load stock return data and asset pricing signals
- Handle missing values and outliers with cross-sectional imputation
- Standardize features for reliable model training
-
Model Training:
- Implement rolling window approach to simulate real-world prediction
- Train models on historical data up to year t-1
- Generate predictions for year t
- Repeat for each year in the sample period
-
Portfolio Construction:
- Sort stocks into quintiles based on model predictions
- Form zero-cost long-short portfolios (long top quintile, short bottom quintile)
- Calculate portfolio returns and performance metrics
-
Performance Evaluation:
- Factor regression analysis to assess risk-adjusted returns
- Portfolio turnover analysis to evaluate trading costs
- Sharpe ratio calculation for risk-adjusted performance comparison
- Special thanks to all contributors to the Open Asset Pricing project
- Financial data provided by CRSP (Center for Research in Security Prices)
- Factor data provided by Kenneth French's data library