Browse Source

feat(sp500-strategies): fix small grammar mistakes and improve clarity of subject

pull/2504/head
nprimo 2 months ago committed by Niccolò Primo
parent
commit
a0b925dece
  1. 146
      subjects/ai/sp500-strategies/README.md

146
subjects/ai/sp500-strategies/README.md

@ -1,32 +1,28 @@
# Financial strategies on the SP500
## Financial strategies on the SP500
In this project we will apply machine to finance. You are a Quant/Data Scientist and your goal is to create a financial strategy based on a signal outputted by a machine learning model that overperforms the [SP500](https://en.wikipedia.org/wiki/S%26P_500).
In this project we will apply machine to finance. You are a Quant/Data Scientist and your goal is to create a financial strategy based on a signal outputted by a machine learning model that over-performs the [SP500](https://en.wikipedia.org/wiki/S%26P_500).
The Standard & Poors 500 Index is a collection of stocks intended to reflect the overall return characteristics of the stock market as a whole. The stocks that make up the S&P 500 are selected by market capitalization, liquidity, and industry. Companies to be included in the S&P are selected by the S&P 500 Index Committee, which consists of a group of analysts employed by Standard & Poor's.
The S&P 500 Index originally began in 1926 as the "composite index" comprised of only 90 stocks. According to historical records, the average annual return since its inception in 1926 through 2018 is approximately 10%–11%.The average annual return since adopting 500 stocks into the index in 1957 through 2018 is roughly 8%.
The Standard & Poor's 500 Index is a collection of stocks intended to reflect the overall return characteristics of the stock market as a whole. The stocks that make up the S&P 500 are selected by market capitalization, liquidity, and industry. Companies to be included in the S&P are selected by the S&P 500 Index Committee, which consists of a group of analysts employed by Standard & Poor's.
The S&P 500 Index originally began in 1926 as the "composite index" comprised of only 90 stocks. According to historical records, the average annual return since its inception in 1926 through 2018 is approximately 10%–11%. The average annual return since adopting 500 stocks into the index in 1957 through 2018 is roughly 8%.
As a Quant Researcher, you may beat the SP500 one year or few years. The real challenge though is to beat the SP500 consistently over decades. That's what most hedge funds in the world are trying to do.
The project is divided in parts:
- **Data processing and feature engineering**: Build a dataset: insightful features and the target
- **Machine Learning pipeline**: Train machine learning models on the dataset, select the best model and generate the machine learning signal.
- **Strategy backtesting**: Generate a strategy from the Machine Learning model output and backtest the strategy. As a reminder, the idea here is to see what would have performed the strategy if you would have invested.
### Deliverables
Do not forget to check the ressources of W1D5 and espcially W1D5E4.
- **Strategy backtesting**: Generate a strategy from the Machine Learning model output and backtest the strategy. As a reminder, the idea here is to see what would have performed the strategy if you had invested.
### Data processing and features engineering
The first file contains SP500 index data (OHLC: 4 time-series) and the other file contains the OHLCV data on the SP500 contituents.
The file `HistoricalData.csv` contains the open-high-low-close (OHLC) SP500 index data and the other file, `all_stocks_5yr.csv`, contains the open-high-low-close-volume (OHLCV) data on the SP500 constituents.
- Split the data in train and test. The test set should set from **2017** .
- Your first priority is to build a dataset without leakage !!! NO LEAKAGE !!!
- Split the data in train and test. The test set should set from **2017**.
- Your first priority is to build a dataset without leakage.
Note: Financial data can be complex and tricky to analyse for a lot of reasons. In order to focus on Time Series forecasting, the project gives access to a "simplified" financial dataset. For instance, we consider the composition of the SP500 remains similar over time which is not true and which introduces a "survivor bias". Plus, the data during covid-19 was removed because it may have a significant impact on the backtesting.
Note: Financial data can be complex and tricky to analyse for a lot of reasons. In order to focus on Time Series forecasting, the project gives access to a "simplified" financial dataset. For instance, we consider the composition of the SP500 remains similar over time which is not true and which introduces a "survivor bias". Plus, the data during COVID-19 was removed because it may have a significant impact on the backtesting.
**"No leakage" [intro](<https://en.wikipedia.org/wiki/Leakage_(machine_learning)>) and small guide:**
We assume it is day D and we want to take a position on the next h days on the next day. The position starts on day D+1 (included). To decide wether we take a short or long position the return between day D+1 and D+2 is computed and used as a target. Finally, as features on day contain information until day D 11:59pm, target need to be shifted. As a result, the final dataframe schema is:
**"No leakage" [intro](<https://en.wikipedia.org/wiki/Leakage_(machine_learning)>).**
We assume it is day `D`, and we want to take a position on the next n days. The position starts on day D+1 (included). To decide whether we take a short or long position the return between day D+1 and D+2 is computed and used as a target. Finally, as features on day contain information until day D 11:59pm, target need to be shifted. As a result, the final DataFrame schema is:
| Index | Features | Target |
| ------- | :------------------------: | ---------------: |
@ -53,52 +49,52 @@ We assume it is day D and we want to take a position on the next h days on the n
- Time Series split (plot below)
- Make sure the last fold of the train set does not overlap on the test set.
- Make sure the folds do not contain data from the same day. The data should be split on the dates.
- Plot your cross validation as follow:
- Plot your cross validation as follows:
![alt text][blocking]
[blocking]: blocking_time_series_split.png "Blocking Time Series split"
[blocking]: blocking_time_series_split.png 'Blocking Time Series split'
![alt text][timeseries]
[timeseries]: Time_series_split.png "Time Series split"
[timeseries]: Time_series_split.png 'Time Series split'
Once you'll have run the gridsearch on the cross validation (choose either Blocking or Time Series split), you'll select the best pipeline on the train set and save it as `selected_model.pkl` and `selected_model.txt` (pipeline hyper-parameters).
Once you'll have run the grid search on the cross validation (choose either Blocking or Time Series split), you'll select the best pipeline on the train set and save it as `selected_model.pkl` and `selected_model.txt` (pipeline hyperparameters).
**Note: You may observe that the selected model is not good after analyzing the ml metrics (ON THE TRAIN SET) and select another one. **
**Note: You may observe that the selected model is not good after analyzing the ML metrics (ON THE TRAIN SET) and select another one. **
- ML metrics and feature importances on the selected pipeline on the train set only.
- DataFrame with a Machine learning metrics on train et validation sets on all folds of the train set. Suggested format: columns: ML metrics (AUC, Accuracy, LogLoss), rows: folds, train set and validation set (double index). Save it as `ml_metrics_train.csv`
- ML metrics and feature importance on the selected pipeline on the train set only.
- DataFrame with a Machine learning metrics to train and validation sets on all folds of the train set. Suggested format: columns: ML metrics (AUC, Accuracy, `LogLoss`), rows: folds, train set and validation set (double index). Save it as `ml_metrics_train.csv`
- Plot. Choose the metric you want. Suggested: AUC Save it as `metric_train.png`. The plot below shows how the plot should look like.
- DataFrame with top 10 important features for each fold. Save it as `top_10_feature_importance.csv`
![alt text][barplot]
[barplot]: metric_plot.png "Metric plot"
[barplot]: metric_plot.png 'Metric plot'
- The signal has to be generated with the chosen cross validation: train the model on the train set of the first fold, then predict on its validation set; train the model on the train set of the second fold, then predict on its validation set, etc ... Then, concatenate the predictions on the validation sets to build the machine learning signal. **The pipeline shouldn't be trained once and predict on all data points !**
- The signal has to be generated with the chosen cross validation: train the model on the train set of the first fold, then predict on its validation set; train the model on the train set of the second fold, then predict on its validation set, etc ... Then, concatenate the predictions on the validation sets to build the machine learning signal. **The pipeline shouldn't be trained once and predict on all data points!**
**The output is a DataFrame or Series with a double index ordered with the probability the stock price for asset i increases between d+1 and d+2.**
**The output is a DataFrame or Series with a double index ordered with the probability the stock price for asset `i` increases between d+1 and d+2.**
- (optional): [Train a RNN/LSTM](https://towardsdatascience.com/predicting-stock-price-with-lstm-13af86a74944). This a nice way to discover and learn about recurrent neural networks. But keep in mind that there are some new neural network architectures that seem to outperform recurrent neural networks: https://towardsdatascience.com/the-fall-of-rnn-lstm-2d1594c74ce0.
- (optional): [Train an RNN/LSTM](https://towardsdatascience.com/predicting-stock-price-with-lstm-13af86a74944). This is a nice way to discover and learn about recurrent neural networks. But keep in mind that there are some new neural network architectures that seem to outperform recurrent neural networks. Here is an [interesting article](https://towardsdatascience.com/the-fall-of-rnn-lstm-2d1594c74ce0) about the topic.
### Strategy backtesting
- Backtesting module deliverables. The module takes as input a machine learning signal, convert it into a financial strategy. A financial strategy DataFrame gives the amount invested at time t on asset i. The module returns the following metrics on the train set and the test set.
- PnL plot: save it as `strategy.png`
- Backtesting module deliverables. The module takes as input a machine learning signal, convert it into a financial strategy. A financial strategy DataFrame gives the amount invested at time `t` on asset `i`. The module returns the following metrics on the train set and the test set.
- Profit and Loss (PnL) plot: save it as `strategy.png`
- x axis: date
- y axis1: PnL of the strategy at time t
- y axis2: PnL of the SP500 at time t
- y axis1: PnL of the strategy at time `t`
- y axis2: PnL of the SP500 at time `t`
- Use the same scale for y axis1 and y axis2
- add a line that shows the separation between train set and test set
- Pnl
- Max drawdown. https://www.investopedia.com/terms/d/drawdown.asp
- (Optional): add other metrics as sharpe ratio, volatility, etc ...
- [Max drawdown](https://www.investopedia.com/terms/d/drawdown.asp)
- (Optional): add other metrics as Sharpe ratio, volatility, etc ...
- Create a markdown report that explains and save it as `report.md`:
- the features used
- the pipeline used
- imputer
- scaler
- `Imputer`
- `Scaler`
- dimension reduction
- model
- the cross-validation used
@ -113,27 +109,27 @@ Once you'll have run the gridsearch on the cross validation (choose either Block
- Long only:
- Binary signal:
0: do nothing for one day on asset i
1: take a long position on asset i for 1 day
0: do nothing for one day on asset `i`
1: take a long position on asset `i` for 1 day
- Weights proportional to the machine learning signals
- invest x on asset i for on day
- invest x on asset `i` for on day
- Long and short: For those who search long short strategy on Google, don't get wrong, this has nothing to do with pair trading.
- Binary signal:
- -1: take a short position on asset i for 1 day
- 1: take a long position on asset i for 1 day
- -1: take a short position on asset `i` for 1 day
- 1: take a long position on asset `i` for 1 day
- Ternary signal:
- -1: take a short position on asset i for 1 day
- 0: do nothing for one day on asset i
- 1: take a long position on asset i for 1 day
- -1: take a short position on asset `i` for 1 day
- 0: do nothing for one day on asset `i`
- 1: take a long position on asset `i` for 1 day
Notes:
- Warning! When you don't invest on all stock as in the binary signal or the ternary signal, make sure that you are still investing 1$ per day!
- Warning! When you don't invest on all stock as in the binary signal or the ternary signal, make sure that you are still investing $1 per day!
- In order to simplify the **short position** we consider that this is the opposite of a long position. Example: I take a short one AAPL stock and the price decreases by 20$ on one day. I earn 20$.
- In order to simplify the **short position** we consider that this is the opposite of a long position. Example: I take a short one AAPL stock and the price decreases by $20 on one day. I earn $20.
- Stock picking: Take a long position on the k best assets (from the machine learning signal) and short the k worst assets regarding the machine learning signal.
- Stock picking: Take a long position on the `k` best assets (from the machine learning signal) and short the `k` worst assets regarding the machine learning signal.
Here's an example on how to convert a machine learning signal into a financial strategy:
@ -164,44 +160,38 @@ Here's an example on how to convert a machine learning signal into a financial s
- Multiply it with the associated return.
Don't forget the meaning of the signal on day d: it gives the return between d+1 and d+2. You should multiply the binary signal of day by the return computed between d+1 and d+2. Otherwise it's wrong because you use your signal that gives you information on d+1 and d+2 on the past or present. The strategy is leaked !
Don't forget the meaning of the signal on day d: it gives the return between d+1 and d+2. You should multiply the binary signal of day by the return computed between d+1 and d+2. Otherwise, it's wrong because you use your signal that gives you information on d+1 and d+2 on the past or present. The strategy is leaked!
**Assumption**: you have 1$ per day to invest in your strategy.
**Assumption**: you have $1 per day to invest in your strategy.
### Project repository structure:
```
project
│ README.md
│ environment.yml
└───data
│ │ sp500.csv
└───results
│ │
| |───cross-validation
│ │ │ ml_metrics_train.csv
│ │ │ metric_train.csv
│ │ │ top_10_feature_importance.csv
│ │ │ metric_train.png
│ │
| |───selected model
│ │ │ selected_model.pkl
│ │ │ selected_model.txt
│ │ │ ml_signal.csv
│ │
| |───strategy
| | | strategy.png
│ │ │ results.csv
│ │ │ report.md
|
|───scripts (free format)
│ │ features_engineering.py
│ │ gridsearch.py
│ │ model_selection.py
│ │ create_signal.py
│ │ strategy.py
├── data
   └── sp500.csv
├── environment.yml
├── README.md
├── results
   ├── cross-validation
     ├── metric_train.csv
     ├── metric_train.png
     ├── ml_metrics_train.csv
     └── top_10_feature_importance.csv
   ├── selected-model
     ├── ml_signal.csv
     ├── selected_model.pkl
     └── selected_model.txt
   └── strategy
   ├── report.md
   ├── results.csv
   └── strategy.png
└── scripts
├── create_signal.py
├── features_engineering.py
├── gridsearch.py
├── model_selection.py
└── strategy
```
@ -209,4 +199,4 @@ Note: `features_engineering.py` can be used in `gridsearch.py`
### Files for this project
You can find the data required for this project in this [link]:(https://assets.01-edu.org/ai-branch/project4/project04-20221031T173034Z-001.zip)
You can find the data required for this project in this [link](https://assets.01-edu.org/ai-branch/project4/project04-20221031T173034Z-001.zip)

Loading…
Cancel
Save