From 5b74816cc86b53c7e349ec1fd7ff546934938ac8 Mon Sep 17 00:00:00 2001 From: Badr Ghazlane Date: Mon, 27 Sep 2021 00:46:22 +0200 Subject: [PATCH] fix: clean description of days 3 and 4 week 2 --- .../week01/day03/ex01/audit/readme.md | 5 ++ .../week01/day03/ex05/audit/readme.md | 2 +- .../week01/day03/ex07/readme.md | 2 +- one_exercise_per_file/week02/day03/readme.md | 48 ++++++++++++++----- one_exercise_per_file/week02/day04/readme.md | 48 +++++++++++++------ 5 files changed, 76 insertions(+), 29 deletions(-) diff --git a/one_exercise_per_file/week01/day03/ex01/audit/readme.md b/one_exercise_per_file/week01/day03/ex01/audit/readme.md index e69de29..0b234ba 100644 --- a/one_exercise_per_file/week01/day03/ex01/audit/readme.md +++ b/one_exercise_per_file/week01/day03/ex01/audit/readme.md @@ -0,0 +1,5 @@ +1. This question is validated if the plot reproduces the plot in the image. It has to contain a title, an x-axis name and a legend. + +![alt text][logo] + +[logo]: ../images/w1day03_ex1_plot1.png "Bar plot ex1" \ No newline at end of file diff --git a/one_exercise_per_file/week01/day03/ex05/audit/readme.md b/one_exercise_per_file/week01/day03/ex05/audit/readme.md index 5fccccc..be1f21b 100644 --- a/one_exercise_per_file/week01/day03/ex05/audit/readme.md +++ b/one_exercise_per_file/week01/day03/ex05/audit/readme.md @@ -11,6 +11,6 @@ The plot has to contain: ![alt text][logo_ex5] -[logo_ex5]: images/day03/w1day03_ex5_plot1.png "Subplots ex5" +[logo_ex5]: ../images/w1day03_ex5_plot1.png "Subplots ex5" Check that the plot has been created with a for loop. \ No newline at end of file diff --git a/one_exercise_per_file/week01/day03/ex07/readme.md b/one_exercise_per_file/week01/day03/ex07/readme.md index 21e781a..c3586b5 100644 --- a/one_exercise_per_file/week01/day03/ex07/readme.md +++ b/one_exercise_per_file/week01/day03/ex07/readme.md @@ -14,7 +14,7 @@ y2 = np.random.randn(50) + 2 ![alt text][logo_ex7] -[logo_ex7]: images/day03/w1day03_ex7_plot1.png "Box plot ex7" +[logo_ex7]: images/w1day03_ex7_plot1.png "Box plot ex7" The plot has to contain: diff --git a/one_exercise_per_file/week02/day03/readme.md b/one_exercise_per_file/week02/day03/readme.md index 3aa7898..1a5e16a 100644 --- a/one_exercise_per_file/week02/day03/readme.md +++ b/one_exercise_per_file/week02/day03/readme.md @@ -1,8 +1,6 @@ -# W2D03 Piscine AI - Data Science +# W2D01 Piscine AI - Data Science -# Table of Contents: - -# Introduction +## Machine Learning Pipeline Today we will focus on the data preprocessing and discover the Pipeline object from scikit learn. @@ -13,23 +11,47 @@ Today we will focus on the data preprocessing and discover the Pipeline object f - The **step 1** is always necessary. Models use numbers, for instance string data can't be processed raw. - The **steps 2** is always necessary. Machine learning models use numbers, missing values do not have mathematical representations, that is why the missing values have to be imputed. -- The **step 3** is required when the dimension of the data set is high. The dimension reduction algorithms reduce the dimensionality of the data either by selecting the variables that contain most of the information (SelectKBest) or by transforming the data. Depending on the signal in the data and the data set size the dimension reduction is not always required. This step is not covered because of its complexity. The understanding of the theory behind is important. However, I suggest to give it a try during the projects. This article gives an introduction. - -- https://towardsdatascience.com/dimensionality-reduction-for-machine-learning-80a46c2ebb7e +- The **step 3** is required when the dimension of the data set is high. The dimension reduction algorithms reduce the dimensionality of the data either by selecting the variables that contain most of the information (SelectKBest) or by transforming the data. Depending on the signal in the data and the data set size the dimension reduction is not always required. This step is not covered because of its complexity. The understanding of the theory behind is important. However, I suggest to give it a try during the projects. -- The **step 4** is required when using some type of Machine Learning algorithms. The Machine Learning algorithms that require the feature scaling are mostly KNN (K-Nearest Neighbors), Neural Networks, Linear Regression, and Logistic Regression. The reason why some algorithms work better with feature scaling is that the minimization of the loss function may be more difficult if each feature's range is completely different. More details: - -- https://medium.com/@societyofai/simplest-way-for-feature-scaling-in-gradient-descent-ae0aaa383039#:~:text=Feature%20scaling%20is%20an%20idea,of%20convergence%20of%20gradient%20descent. +- The **step 4** is required when using some type of Machine Learning algorithms. The Machine Learning algorithms that require the feature scaling are mostly KNN (K-Nearest Neighbors), Neural Networks, Linear Regression, and Logistic Regression. The reason why some algorithms work better with feature scaling is that the minimization of the loss function may be more difficult if each feature's range is completely different. These steps are sequential. The output of step 1 is used as input for step 2 and so on; and, the output of step 4 is used as input for the Machine Learning model. Scikitlearn proposes an object: Pipeline. As we know, the model evaluation methodology requires to split the data set in a train set and test set. **The preprocessing is learned/fitted on the training set and applied on the test set**. -- https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html - This object takes as input the preprocessing transforms and a Machine Learning model. Then this object can be called the same way a Machine Learning model is called. This is pretty practical because we do not need anymore to carry many objects. +## Exercises of the day + +- Exercise 1 Imputer 1 +- Exercise 2 Scaler +- Exercise 3 One hot Encoder +- Exercise 4 Ordinal Encoder +- Exercise 5 Categorical variables +- Exercise 6 Pipeline + + +## Virtual Environment +- Python 3.x +- NumPy +- Pandas +- Matplotlib +- Scikit Learn +- Jupyter or JupyterLab + +*Version of Scikit Learn I used to do the exercises: 0.22*. I suggest to use the most recent one. Scikit Learn 1.0 is finally available after ... 14 years. + ## Ressources -TODO \ No newline at end of file +### Step 3 + +- https://towardsdatascience.com/dimensionality-reduction-for-machine-learning-80a46c2ebb7e + +### Step 4 + +- https://medium.com/@societyofai/simplest-way-for-feature-scaling-in-gradient-descent-ae0aaa383039#:~:text=Feature%20scaling%20is%20an%20idea,of%20convergence%20of%20gradient%20descent. + +### Pipeline + +- https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html \ No newline at end of file diff --git a/one_exercise_per_file/week02/day04/readme.md b/one_exercise_per_file/week02/day04/readme.md index 830e5b2..e51418c 100644 --- a/one_exercise_per_file/week02/day04/readme.md +++ b/one_exercise_per_file/week02/day04/readme.md @@ -1,10 +1,8 @@ -# D04 Piscine AI - Data Science +# W2D04 Piscine AI - Data Science -# Table of Contents: +## Train and evaluate Machine Learning models -# Introduction - -Today we will learn how to choose the right Machine Learning metric depending on the problem you are solving and to compute it. A metric gives an idea of how good the model performs. Depending on working on a classification problem or a regression problem the metrics considered are different. It is important to understand that all metrics are just metrics, not the truth. +Today we will learn how to train and evaluate a machine learning model. You'll learn how tochoose the right Machine Learning metric depending on the problem you are solving and to compute it. A metric gives an idea of how good the model performs. Depending on working on a classification problem or a regression problem the metrics considered are different. It is important to understand that all metrics are just metrics, not the truth. We will focus on the most important metrics: @@ -15,20 +13,42 @@ We will focus on the most important metrics: Warning: **Imbalanced data set** -Let us assume we are predicting a rare event that occurs less than 2% of the time. Having a model that scores a good accuracy is easy, it doesn't have to be "smart", all it has to do is to always predict the majority class. Depending on the problem it can be disastrous. For example, working with real life data, breast cancer prediction is an imbalanced problem where predicting the majority leads to disastrous consequences. That is why metrics as AUC are useful. +Let us assume we are predicting a rare event that occurs less than 2% of the time. Having a model that scores a good accuracy is easy, it doesn't have to be "smart", all it has to do is to always predict the majority class. Depending on the problem it can be disastrous. For example, working with real life data, breast cancer prediction is an imbalanced problem where predicting the majority leads to disastrous consequences. That is why metrics as AUC are useful. Before to compute the metrics, read carefully this article to understand the role of these metrics. -- https://stats.stackexchange.com/questions/260164/auc-and-class-imbalance-in-training-test-dataset +You'll learn to train other types of Machine Learning models than linear regression and logistic regression. You're not supposed to spend time understanding the theory. I recommend to do that during the projects. Today, read the Scikit-learn documentation to have a basic understanding of the models you use. Focus on how to use correctly those Machine Learning models with Scikit-learn. -Before to compute the metrics, read carefully this article to understand the role of these metrics. +You'll also learn what is a grid-search and how to use it to train your machine learning models. -- https://www.kdnuggets.com/2018/06/right-metric-evaluating-machine-learning-models-2.html +## Exercises of the day + +- Exercise 1 MSE Scikit-learn +- Exercise 2 Accuracy Scikit-learn +- Exercise 3 Regression +- Exercise 4 Classification +- Exercise 5 Machine Learning models +- Exercise 6 Grid Search + + +## Virtual Environment +- Python 3.x +- NumPy +- Pandas +- Matplotlib +- Scikit Learn +- Jupyter or JupyterLab -+ ML models + GS +*Version of Scikit Learn I used to do the exercises: 0.22*. I suggest to use the most recent one. Scikit Learn 1.0 is finally available after ... 14 years. -## Historical +## Ressources -## Rules +### Metrics + +- https://www.kdnuggets.com/2018/06/right-metric-evaluating-machine-learning-models-2.html + +### Imbalance datasets + +- https://stats.stackexchange.com/questions/260164/auc-and-class-imbalance-in-training-test-dataset -## Resources +### Gridsearch -- https://scikit-learn.org/stable/modules/model_evaluation.html +- https://medium.com/fintechexplained/what-is-grid-search-c01fe886ef0a \ No newline at end of file