fix: adapt audit project 4 and fix p1,2 and 3

2 years ago · 186dae52d5
7 changed files with 350 additions and 4 deletions
--- a/one_md_per_day_format/projects/project1/readme.md
+++ b/one_md_per_day_format/projects/project1/readme.md
@ -1,4 +1,4 @@
-# First Kaggle
+# Your first Kaggle: Titanic 

 ## Introduction

--- a/one_md_per_day_format/projects/project3/readme.md
+++ b/one_md_per_day_format/projects/project3/readme.md
@ -1,5 +1,4 @@
-# Computer vision
-
+#  Emotions detection with Deep Learning
 Cameras are everywhere. Videos and images have become one of the most interesting data sets for artificial intelligence.
 Image processing is a quite board research area, not just filtering, compression, and enhancement. Besides, we are even interested in the question, “what is in images?”, i.e., content analysis of visual inputs, which is part of the main task of computer vision. The study of computer vision could make possible such tasks as 3D reconstruction of scenes, motion capturing, and object recognition, which are crucial for even higher-level intelligence such as
 image and video understanding, and motion understanding.
@ -33,7 +32,7 @@ The two steps are detailed below.
 - Participate to this challenge: https://www.kaggle.com/c/digit-recognizer/code . The MNIST dataset is a reference in computer vision. Researchers use it as a benchmark to compare their models. Start first with a logistic regression to understand how to handle images in Python. And then train your first CNN on this data set.


-## Face emotions classfication
+## Face emotions classification


 Emotion detection is one of the most researched topics in the modern-day machine learning arena. The ability to accurately detect and identify an emotion opens up numerous doors for Advanced Human Computer Interaction. The aim of this project is to detect up to seven distinct facial emotions in real time. This project runs on top of a Convolutional Neural Network (CNN) that is built with the help of Keras whose backend is TensorFlow in Python. The facial emotions that can be detected and classified by this system are Happy, Sad, Angry, Surprise, Fear, Disgust and Neutral.
--- a/one_md_per_day_format/projects/project4/Time_series_split.png
+++ b/one_md_per_day_format/projects/project4/Time_series_split.png
--- a/one_md_per_day_format/projects/project4/audit/readme.md
+++ b/one_md_per_day_format/projects/project4/audit/readme.md
@ -0,0 +1,133 @@
+# Financial strategies on the SP500
+ 
+ This documents is the correction of the project 4. Some steps are detailed in W1D5E4. TODO: replace with quest name
+ 
+
+```
+project
+│   README.md
+│   environment.yml    
+│
+└───data
+│   │   sp500.csv
+│
+└───results
+│   │   
+|   |───cross-validation 
+│   │   │   ml_metrics_train.csv
+│   │   │   metric_train.csv
+│   │   │   top_10_feature_importance.csv
+│   │   │   metric_train.png
+│   │  
+|   |───selected model
+│   │   │   selected_model.pkl
+│   │   │   selected_model.txt
+│   │   │   ml_signal.csv
+│   │   
+|   |───strategy
+|   |   |   strategy.png
+│   │   │   results.csv
+│   │   │   report.md
+|
+|───scripts (free format)
+│   │   features_engineering.py
+│   │   gridsearch.py
+│   │   model_selection.py
+│   │   create_signal.py
+│   │   strategy.py
+
+``` 
+
+###### Does the structure of the project is as below ? 
+
+###### Does the readme file summurize how to run the code and explain the global approach ? 
+
+###### Does the environment contain all libraries used and their versions that are necessary to run the code ? 
+
+###### Do the text files explain the chosen model methodology ? 
+
+
+## **Data processing and feature engineering**
+
+###### Is the data splitted in a train set and test set ? 
+###### Is the last day of the train set is D and the first day of the test set is D+n with n>0 ? Splitting without considering the time series structure is wrong. 
+
+##### There is no leakage: unfortunately there's no autamated way to check if the dataset is leaked. This step is validated if the features of date d are built as follow: 
+
+| Index    |  Features                   |Target |
+|----------|:-------------:              |------:|
+| Day D-1  |  Features until D-1 23:59pm | return(D, D+1) |
+| Day D    |  Features until D 23:59pm | return(D+1, D+2) |
+| Day D+1  |  Features until D+1 23:59pm   | return(D+2, D+3) |
+
+###### Have the features been grouped by ticker before to compute the features ? 
+
+###### - Has the target been grouped by ticker before to compute the futur returns ? 
+
+
+## **Machine Learning pipeline**
+
+### Cross-Validation
+
+###### Does the CV contain at least 10 folds in total ? 
+###### Do all train folds have more than 2y history ?  If you use time series split, checking that the first fold has more than 2y history is enough. 
+##### The last validation set of the train set doesn't overlap on the test set. 
+##### None of the folds contain data from the same day.The split should be done on the dates. 
+##### There's a plot showing your cross-validation. As usual, all plots should have named axis and a title.If you chose a Time Series Split the plot should look like this: 
+
+
+![alt text][timeseries]
+
+[timeseries]: ../Time_series_split.png "Time Series split" 
+
+
+### Model Selection
+
+##### The test set hasn't been used to train the model and select the model. 
+###### Is the selected model  saved in the pkl file and described in a txt file ? 
+
+### Selected model
+
+##### The ml metrics computed on the train set are agregated: sum or median. 
+###### Are the ml metrics saved in a csv file ? 
+###### Are the top 10 important features per fold are saved in `top_10_feature_importance.csv`? 
+###### Does `metric_train.png` show a plot similar to the one below ? 
+*Note that, this can be done also on the test set **IF** this hasn't helped to select the pipeline. *
+
+![alt text][barplot]
+
+[barplot]: ../metric_plot.png "Metric plot"
+
+### Machine learning signal
+
+##### **The pipeline shouldn't be trained once and predict on all data points !** As explained: The signal has to be generated with the chosen cross validation: train the model on the train set of the first fold, then predict on its validation set; train the model on the train set of the second fold, then predict on its validation set, etc ... Then, concatenate the predictions on the validation sets to build the machine learning signal.
+
+
+## **Strategy backtesting**
+
+### Convert machine learning signal into a strategy
+
+##### The transformed machine learning signal (long only, long short, binary, ternary, stock picking, proportional to probability or custom ) is multiplied by the return between d+1 and d+2. As a reminder, the signal at date d predicts wether the return between d+1 and d+2 is increasing or deacreasing. Then, the PnL of date d could be associated with date d, d+1 or d+2. This is arbitrary and should impact the value of the PnL. 
+
+##### You invest the same amount of money every day. One exception: if you invest 1$ per day per stock the amount invested every day may change depending on the strategy chosen. If you take into account the different values of capital invested every day in the calculation of the PnL, the step is still validated. 
+
+### Metrics and plot
+
+###### Is the Pnl computed as: strategy * futur_return ?  
+###### Does the strategy give the amount invested at time t on asset i ? 
+
+###### Does the plot `strategy.png` contains an x axis: date ? 
+###### Does the plot `strategy.png` contains a y axis1: PnL of the strategy at time t ? 
+###### Does the plot `strategy.png` contains a y axis2: PnL of the SP500 at time t ? 
+###### Does the plot `strategy.png` use the same scale for y axis1 and y axis2 ? 
+###### Does the plot `strategy.png` contains a vertical  line that shows the separation between train set and test set ?
+
+### Report
+
+###### Does the report detail the features used ? 
+
+###### Does the report detail the pipeline used (imputer, scaler, dimension reduction and model) ? 
+
+###### Does the report detail the cross-validation used (length of train sets and validation sets and if possible the cross-validation plot) ? 
+
+###### Does the report detail the strategy chosen (description, PnL plot and the strategy metrics on the train set and test set) ? 
--- a/one_md_per_day_format/projects/project4/blocking_time_series_split.png
+++ b/one_md_per_day_format/projects/project4/blocking_time_series_split.png
--- a/one_md_per_day_format/projects/project4/metric_plot.png
+++ b/one_md_per_day_format/projects/project4/metric_plot.png
--- a/one_md_per_day_format/projects/project4/readme.md
+++ b/one_md_per_day_format/projects/project4/readme.md
@ -0,0 +1,214 @@
+# Financial strategies on the SP500
+
+TODO: data delivery and choose train/test split date. 
+
+In this project we will apply machine to finance. You are a Quant/Data Scientist and your goal is to create a financial strategy based on a signal outputted by a machine learning model that overperforms the [SP500](https://en.wikipedia.org/wiki/S%26P_500). 
+
+
+The Standard & Poors 500 Index is a collection of stocks intended to reflect the overall return characteristics of the stock market as a whole. The stocks that make up the S&P 500 are selected by market capitalization, liquidity, and industry. Companies to be included in the S&P are selected by the S&P 500 Index Committee, which consists of a group of analysts employed by Standard & Poor's.
+The S&P 500 Index originally began in 1926 as the "composite index" comprised of only 90 stocks. According to historical records, the average annual return since its inception in 1926 through 2018 is approximately 10%–11%.The average annual return since adopting 500 stocks into the index in 1957 through 2018 is roughly 8%. 
+As a Quant Researcher, you may beat the SP500 one year or few years. The real challenge though is to beat the SP500 consistently over decades. That's what most hedge funds in the world are trying to do. 
+
+
+The project is divided in parts: 
+
+- **Data processing and feature engineering**: Build a dataset: insightful features and the target
+- **Machine Learning pipeline**: Train machine learning models on the dataset, select the best model and generate the machine learning signal. 
+- **Strategy backtesting**: Generate a strategy from the Machine Learning model output and backtest the strategy. As a reminder, the idea here is to see what would have performed the strategy if you would have invested.
+
+## Deliverables
+
+Do not forget to check the ressources of W1D5 and espcially W1D5E4. 
+TODO: replace by quest name and exercice number
+
+### Data processing and features engineering
+- Split the data in train and test (TODO: choose the year - once the data is delivered)
+- Your first priority is to build a dataset without leakage !!! NO LEAKAGE !!! 
+
+**"No leakage" small guide:**
+We assume it is day D and we want to take a position on the next h days on the next day. The position starts on day D+1 (included). To decide wether we take a short or long position the return between day D+1 and D+2 is computed and used as a target. Finally, as features on day contain information until day D 11:59pm, target need to be shifted. As a result, the final dataframe schema is: 
+
+
+| Index    |  Features                   |Target |
+|----------|:-------------:              |------:|
+| Day D-1  |  Features until D-1 23:59pm | return(D, D+1) |
+| Day D    |  Features until D 23:59pm | return(D+1, D+2) |
+| Day D+1  |  Features until D+1 23:59pm   | return(D+2, D+3) |
+
+**Note: This table is simplified, the index of your DataFrame is a multi-index with date and ticker.**
+
+- Features:
+    - Bollinger
+    - RSI
+    - MACD
+**Note: you can use any library to compute these features, you don't need to implement all financial features from scratch.**
+
+- Target:
+    - On day D, the target is: **sign(return(D+1, D+2))**
+
+> Remark: The target used is the return computed on the price and not the price directly. There are statistical reasons for this choice - the price is not stationary. The consequence is that a machine learning model tends to overfit while training on not stationary data. 
+
+### Machine learning pipeline
+
+- Cross-validation deliverables:
+    - Implements a cross validation with at least 10 folds. The train set has to be bigger than 2 years history. 
+    - Two types of temporal cross-validations are required:
+        - Blocking (plot below)
+        - Time Series split (plot below)
+    - Make sure the last fold of the train set does not overlap on the test set. 
+    - Make sure the folds do not contain data from the same day. The data should be split on the dates. 
+    - Plot your cross validation as follow: 
+
+![alt text][blocking]
+
+[blocking]: blocking_time_series_split.png 'Blocking Time Series split'
+
+![alt text][timeseries]
+
+[timeseries]: Time_series_split.png 'Time Series split'
+
+Once you'll have run the gridsearch on the cross validation (choose either Blocking or Time Series split), you'll select the best pipeline on the train set and save it as `selected_model.pkl` and `selected_model.txt` (pipeline hyper-parameters). 
+
+**Note: You may observe that the selected model is not good after analyzing the ml metrics (ON THE TRAIN SET) and select another one. **
+
+- ML metrics and feature importances on the selected pipeline on the train set only. 
+    - DataFrame with a Machine learning metrics on train et validation sets on all folds of the train set. Suggested format: columns: ML metrics (AUC, Accuracy, LogLoss), rows: folds, train set and validation set (double index). Save it as `ml_metrics_train.csv`
+    - Plot. Choose the metric you want. Suggested: AUC Save it as `metric_train.png`. The plot below shows how the plot should look like. 
+    - DataFrame with top 10 important features for each fold. Save it as `top_10_feature_importance.csv`
+
+![alt text][barplot]
+
+[barplot]: metric_plot.png 'Metric plot'
+
+- The signal has to be generated with the chosen cross validation: train the model on the train set of the first fold, then predict on its validation set; train the model on the train set of the second fold, then predict on its validation set, etc ... Then, concatenate the predictions on the validation sets to build the machine learning signal. **The pipeline shouldn't be trained once and predict on all data points !** 
+
+**The output is a DataFrame or Series with a double index ordered with the probability the stock price for asset i increases between d+1 and d+2.**
+
+- (optional): [Train a RNN/LSTM](https://towardsdatascience.com/predicting-stock-price-with-lstm-13af86a74944). This a nice way to discover and learn about recurrent neural networks. But keep in mind that there are some new neural network architectures that seem to outperform recurrent neural networks: https://towardsdatascience.com/the-fall-of-rnn-lstm-2d1594c74ce0. 
+
+
+## Strategy backtesting
+
+- Backtesting module deliverables. The module takes as input a machine learning signal, convert it into a financial strategy.  A financial strategy DataFrame gives the amount invested at time t on asset i. The module returns the following metrics on the train set and the test set. 
+    - PnL plot: save it as `strategy.png`
+        - x axis: date
+        - y axis1: PnL of the strategy at time t
+        - y axis2: PnL of the SP500 at time t
+        - Use the same scale for y axis1 and y axis2
+        - add a line that shows the separation between train set and test set 
+    - Pnl
+    - Max drawdown. https://www.investopedia.com/terms/d/drawdown.asp
+    - (Optional): add other metrics as sharpe ratio, volatility, etc ... 
+    - Create a markdown report that explains and save it as `report.md`:
+        - the features used
+        - the pipeline used
+            - imputer
+            - scaler
+            - dimension reduction
+            - model
+        - the cross-validation used
+            - length of train sets and validation sets
+            - cross-validation plot (optional)
+        - strategy chosen
+            - description
+            - PnL plot
+            - strategy metrics on the train set and test set
+
+### Example of strategies: 
+- Long only: 
+    - Binary signal: 
+        0: do nothing for one day on asset i
+        1: take a long position on asset i for 1 day
+    - Weights proportional to the machine learning signals
+        - invest x on asset i for on day
+- Long and short: For those who search long short strategy on Google, don't get wrong, this has nothing to do with pair trading. 
+    - Binary signal:
+        - -1: take a short position on asset i for 1 day
+        - 1: take a long position on asset i for 1 day
+    - Ternary signal:
+        - -1: take a short position on asset i for 1 day
+        - 0: do nothing for one day on asset i
+        - 1: take a long position on asset i for 1 day
+
+    Notes: 
+
+    - Warning! When you don't invest on all stock as in the binary signal or the ternary signal, make sure that you are still investing 1$ per day! 
+
+    - In order to simplify the **short position** we consider that this is the opposite of a long position. Example: I take a short one AAPL stock and the price decreases by 20$ on one day. I earn 20$.
+
+- Stock picking: Take a long position on the k best assets (from the machine learning signal) and short the k worst assets regarding the machine learning signal. 
+
+Here's an example on how to convert a machine learning signal into a financial strategy: 
+
+- Input: 
+
+| Date   | Ticker|Machine Learning signal |
+|--------|:----: |-----------:|
+| Day D-1|  AAPL | 0.55 |
+| Day D-1|  C    | 0.36 |
+| Day D  |  AAPL | 0.59 |
+| Day D  |  C    | 0.33 |
+| Day D+1|  AAPL | 0.61 |
+| Day D+1|  C    | 0.33 |
+
+- Convert it into a binary long only strategy: 
+    - Machine learning signal > 0.5
+    
+| Date   | Ticker|Binary signal |
+|--------|:----: |-----------:|
+| Day D-1|  AAPL | 1 |
+| Day D-1|  C    | 0 |
+| Day D  |  AAPL | 1 |
+| Day D  |  C    | 0 |
+| Day D+1|  AAPL | 1 |
+| Day D+1|  C    | 0 |
+
+
+!!! BE CAREFUL !!!THIS IS EXTREMELY IMPORTANT. 
+
+- Multiply it with the associated return. 
+
+    Don't forget the meaning of the signal on day d: it gives the return between d+1 and d+2. You should multiply the binary signal of day by the return computed between d+1 and d+2. Otherwise it's wrong because you use your signal that gives you information on d+1 and d+2 on the past or present. The strategy is leaked ! 
+
+**Assumption**: you have 1$ per day to invest in your strategy. 
+
+
+## Project repository structure: 
+
+```
+project
+│   README.md
+│   environment.yml    
+│
+└───data
+│   │   sp500.csv
+│
+└───results
+│   │   
+|   |───cross-validation 
+│   │   │   ml_metrics_train.csv
+│   │   │   metric_train.csv
+│   │   │   top_10_feature_importance.csv
+│   │   │   metric_train.png
+│   │  
+|   |───selected model
+│   │   │   selected_model.pkl
+│   │   │   selected_model.txt
+│   │   │   ml_signal.csv
+│   │   
+|   |───strategy
+|   |   |   strategy.png
+│   │   │   results.csv
+│   │   │   report.md
+|
+|───scripts (free format)
+│   │   features_engineering.py
+│   │   gridsearch.py
+│   │   model_selection.py
+│   │   create_signal.py
+│   │   strategy.py
+
+``` 
+Note: `features_engineering.py` can be used in `gridsearch.py`
+
+