Browse Source

Merge pull request #8 from 01-edu/day-03-week02-testing

week2 - day3: testing and feedback
pull/10/head
brad-gh 3 years ago committed by GitHub
parent
commit
4de4d87485
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
  1. 284
      one_md_per_day_format/piscine/Week2/day03.md

284
one_md_per_day_format/piscine/Week2/day03.md

@ -1,84 +1,80 @@
# W2D03 Piscine AI - Data Science
# W2D03 Piscine AI - Data Science
# Table of Contents:
# Introduction
Today we will focus on the data preprocessing and discover the Pipeline object from scikit learn.
Today we will focus on the data preprocessing and discover the Pipeline object from scikit learn.
1. Manage categorical variables with Integer encoding and One Hot Encoding
2. Impute the missing values
3. Reduce the dimension of the data
4. Scale the data
1. Manage categorical variables with Integer encoding and One Hot Encoding
2. Impute the missing values
3. Reduce the dimension of the data
4. Scale the data
- The **step 1** is always necessary. Models use numbers, for instance string data can't be processed raw.
- The **steps 2** is always necessary. Machine learning models use numbers, missing values do not have mathematical representations, that is why the missing values have to be imputed.
- The **step 3** is required when the dimension of the data set is high. The dimension reduction algorithms reduce the dimensionality of the data either by selecting the variables that contain most of the information (SelectKBest) or by transforming the data. Depending on the signal in the data and the data set size the dimension reduction is not always required. This step is not covered because of its complexity. The understanding of the theory behind is important. However, I suggest to give it a try during the projects. This article gives an introduction.
- https://towardsdatascience.com/dimensionality-reduction-for-machine-learning-80a46c2ebb7e
- The **step 1** is always necessary. Models use numbers, for instance string data can't be processed raw.
- The **steps 2** is always necessary. Machine learning models use numbers, missing values do not have mathematical representations, that is why the missing values have to be imputed.
- The **step 3** is required when the dimension of the data set is high. The dimension reduction algorithms reduce the dimensionality of the data either by selecting the variables that contain most of the information (SelectKBest) or by transforming the data. Depending on the signal in the data and the data set size the dimension reduction is not always required. This step is not covered because of its complexity. The understanding of the theory behind is important. However, I suggest to give it a try during the projects. This article gives an introduction.
https://towardsdatascience.com/dimensionality-reduction-for-machine-learning-80a46c2ebb7e
- The **step 4** is required when using some type of Machine Learning algorithms. The Machine Learning algorithms that require the feature scaling are mostly KNN (K-Nearest Neighbors), Neural Networks, Linear Regression, and Logistic Regression. The reason why some algorithms work better with feature scaling is that the minimization of the loss function may be more difficult if each feature's range is completely different. More details:
- The **step 4** is required when using some type of Machine Learning algorithms. The Machine Learning algorithms that require the feature scaling are mostly KNN (K-Nearest Neighbours), Neural Networks, Linear Regression, and Logistic Regression. The reason why some algorithms work better with feature scaling is that the minimization of the loss function may be more difficult if each feature's range is completely different. More details: https://medium.com/@societyofai/simplest-way-for-feature-scaling-in-gradient-descent-ae0aaa383039#:~:text=Feature%20scaling%20is%20an%20idea,of%20convergence%20of%20gradient%20descent.
- https://medium.com/@societyofai/simplest-way-for-feature-scaling-in-gradient-descent-ae0aaa383039#:~:text=Feature%20scaling%20is%20an%20idea,of%20convergence%20of%20gradient%20descent.
These steps are sequential. The output of step 1 is used as input for step 2 and so on; and, the output of step 4 is used as input for the Machine Learning model.
Scikitlearn proposes an object: Pipeline.
These steps are sequential. The output of step 1 is used as input for step 2 and so on; and, the output of step 4 is used as input for the Machine Learning model.
Scikitlearn proposes an object: Pipeline.
As we know, the model evaluation methodology requires to split the data set in a train set and test set. **The preprocessing is learned/fitted on the training set and applied on the test set**.
https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
- https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
This object takes as input the preprocessing transforms and a Machine Learning model. Then this object can be called the same way a Machine Learning model is called. This is pretty practical because we do not need anymore to carry many objects.
This object takes as input the preprocessing transforms and a Machine Learning model. Then this object can be called the same way a Machine Learning model is called. This is pretty practical because we do not need anymore to carry many objects.
## Ressources
## Ressources
TODO
# Exercise 1 Imputer 1
The goal of this exercise is to learn how to use an Imputer to fill missing values on basic example.
```
```python
train_data = [[7, 6, 5],
[4, np.nan, 5],
[1, 20, 8]]
```
1. Fit the `SimpleImputer` on the data. Print the `statistics_`. Check that the statistics match `np.nanmean(train_data, axis=0)`.
2. Fill the missing values in `train_data` using the fitted imputer and `transform`.
3. Fill the missing values in `test_data` using the fitted imputer and `transform`.
```
3. Fill the missing values in `test_data` using the fitted imputer and `transform`.
```python
test_data = [[np.nan, 1, 2],
[7, np.nan, 9],
[np.nan, 2, 4]]
```
## Correction
## Correction
1. This question is validated if the `imp_mean.statistics_` returns:
1. This question is validated if the `imp_mean.statistics_` returns:
```
```console
array([ 4., 13., 6.])
```
2. This question is validated if the filled train set is:
2. This question is validated if the filled train set is:
```
```console
array([[ 7., 6., 5.],
[ 4., 13., 5.],
[ 1., 20., 8.]])
```
3. This question is validated if the filled test set is:
3. This question is validated if the filled test set is:
```
```console
array([[ 4., 1., 2.],
[ 7., 13., 9.],
[ 4., 2., 4.]])
@ -86,67 +82,66 @@ test_data = [[np.nan, 1, 2],
# Exercise 2 Scaler
The goal of this exercise is to learn to scale a data set. There are various scaling techniques, we will focus on `StandardScaler` from scikit learn.
The goal of this exercise is to learn to scale a data set. There are various scaling techniques, we will focus on `StandardScaler` from scikit learn.
We will use a tiny data set for this exercise that we will generate by ourselves:
We will use a tiny data set for this exercise that we will generate by ourselves:
```
```python
X_train = np.array([[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]])
```
1. Fit the `StandardScaler` on the data and scale X_train using `fit_transform`. Compute the `mean` and `std` on `axis 0 `.
1. Fit the `StandardScaler` on the data and scale X_train using `fit_transform`. Compute the `mean` and `std` on `axis 0`.
2. Scale the test set using the `StandardScaler` fitted on the train set.
2. Scale the test set using the `StandardScaler` fitted on the train set.
```
```python
X_test = np.array([[ 2., -1., 1.],
[ 3., 3., -1.],
[ 1., 1., 1.]])
```
**WARNING:
If the data is split in train and test set, it is extremely important to apply the same scaling the the test data. As the model is trained on scaled data, if it takes as input unscaled data, it returns incorrect values.**
**WARNING:
If the data is split in train and test set, it is extremely important to apply the same scaling the test data. As the model is trained on scaled data, if it takes as input unscaled data, it returns incorrect values.**
Resources:
Ressources:
https://medium.com/technofunnel/what-when-why-feature-scaling-for-machine-learning-standard-minmax-scaler-49e64c510422
https://scikit-learn.org/stable/modules/preprocessing.html
- https://medium.com/technofunnel/what-when-why-feature-scaling-for-machine-learning-standard-minmax-scaler-49e64c510422
- https://scikit-learn.org/stable/modules/preprocessing.html
## Correction
1. This question is validated if the scaled train set is:
```
```console
array([[ 0. , -1.22474487, 1.33630621],
[ 1.22474487, 0. , -0.26726124],
[-1.22474487, 1.22474487, -1.06904497]])
```
- The mean on axis 0 should return:
- array([0., 0., 0.])
- The std on axis 0 should return:
- array([1., 1., 1.])
- array([0., 0., 0.])
- The std on axis 0 should return:
- array([1., 1., 1.])
2. This question is validated if the scaled test set is:
2. This question is validated if the scaled test set is:
```
```console
array([[ 1.22474487, -1.22474487, 0.53452248],
[ 2.44948974, 3.67423461, -1.06904497],
[ 0. , 1.22474487, 0.53452248]])
```
# Exercise 3 One hot Encoder
The goal of this exercise is to learn how to deal with Categorical variables using the OneHot Encoder.
```
X_train = [['Python'], ['Java'], ['Java'], ['C++']]
# Exercise 3 One hot Encoder
The goal of this exercise is to learn how to deal with Categorical variables using the OneHot Encoder.
```python
X_train = [['Python'], ['Java'], ['Java'], ['C++']]
```
1. Using `OneHotEncoder` with `handle_unknown='ignore'`, fit the One Hot Encoder and transform X_train. The expected output is:
| | ('C++',) | ('Java',) | ('Python',) |
@ -158,14 +153,13 @@ X_train = [['Python'], ['Java'], ['Java'], ['C++']]
To get this output create a DataFrame from the transformed X_train and the attribute `categories_`.
2. Transform X_test using the fitted One Hot Encoder on the train set.
2. Transform X_test using the fitted One Hot Encoder on the train set.
```
X_test = [['Python'], ['Java'], ['C'], ['C++']]
```
```python
X_test = [['Python'], ['Java'], ['C'], ['C++']]
```
The expected output is:
The expected output is:
| | ('C++',) | ('Java',) | ('Python',) |
|---:|-----------:|------------:|--------------:|
@ -174,12 +168,11 @@ X_train = [['Python'], ['Java'], ['Java'], ['C++']]
| 2 | 0 | 0 | 0 |
| 3 | 1 | 0 | 0 |
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
## Correction
## Correction
1. This question is validated if the output is
1. This question is validated if the output is
| | ('C++',) | ('Java',) | ('Python',) |
|---:|-----------:|------------:|--------------:|
@ -188,7 +181,7 @@ https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEn
| 2 | 0 | 1 | 0 |
| 3 | 1 | 0 | 0 |
2. This question is validated if the output is:
2. This question is validated if the output is:
| | ('C++',) | ('Java',) | ('Python',) |
|---:|-----------:|------------:|--------------:|
@ -197,34 +190,31 @@ https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEn
| 2 | 0 | 0 | 0 |
| 3 | 1 | 0 | 0 |
# Exercise 4 Ordinal Encoder
The goal of this exercise is to learn how to deal with Categorical variables using the Ordinal Encoder.
In that case, we want the model to consider that: **good > neutral > bad**
The goal of this exercise is to learn how to deal with Categorical variables using the Ordinal Encoder.
In that case, we want the model to consider that: **good > neutral > bad**
```
```python
X_train = [['good'], ['bad'], ['neutral']]
```
1. Fit the `OrdinalEncoder` by specifying the categories in the following order: `categories=[['bad', 'neutral', 'good']]`. Transform the train set. Print the `categories_`
1. Fit the `OrdinalEncoder` by specifying the categories in the following order: `categories=[['bad', 'neutral', 'good']]`. Transform the train set. Print the `categories_`
2. Transform the X_test using the fitted Ordinal Encoder on train set.
2. Transform the X_test using the fitted Ordinal Encoder on train set.
```
```python
X_test = [['good'], ['good'], ['bad']]
```
*Note: In the version 0.22 of Scikit-learn, the Ordinal Encoder doesn't handle new values in the test set. But it will be possible in the version 0.24 !*
## Correction
*Note: In the version 0.22 of Scikit-learn, the Ordinal Encoder doesn't handle new values in the test set. But it will be possible in the version 0.24 !*
1. This question is validated if the output of the Ordinal Encoder on the train set is:
## Correction
```
1. This question is validated if the output of the Ordinal Encoder on the train set is:
```console
array([[2.],
[0.],
[1.]])
@ -232,31 +222,30 @@ array([[2.],
Check that `enc.categories_` returns`[array(['bad', 'neutral', 'good'], dtype=object)]`.
2. This question is validated if the output of the Ordinal Encoder on the test set is:
2. This question is validated if the output of the Ordinal Encoder on the test set is:
```
```console
array([[2.],
[2.],
[0.]])
```
# Exercise 5 Categorical variables
# Exercise 5 Categorical variables
The goal of this exercise is to learn how to deal with Categorical variables with Ordinal Encoder, Label Encoder and OneHot Encoder.
The goal of this exercise is to learn how to deal with Categorical variables with Ordinal Encoder, Label Encoder and One Hot Encoder.
Preliminary:
- Load the breast-cancer.csv file
- Drop `Class` column
- Drop NaN values
- Split the data in a train set and test set (test set size = 20% of the total size) with `random_state=43`.
- Split the data in a train set and test set (test set size = 20% of the total size) with `random_state=43`.
1. Count the number of unique values per feature in the train set
2. Identify the variables ordinal variables, nominal variables and the target. Create one One Hot Encoder for all categorical features (no ordinal). Here are the assumptions made on the variables:
1. Count the number of unique values per feature in the train set.
2. Identify the variables ordinal variables, nominal variables and the target. Create one One Hot Encoder for all categorical features (no ordinal). Here are the assumptions made on the variables:
```
```console
age: Ordinal
['ge40'> 'premeno' >'lt40']
@ -284,15 +273,14 @@ breast-quad: One Hot
irradiat: One Hot
['recurrence-events' 'no-recurrence-events']
```
- Fit on the train set
- Transform the test set
Example of expected output:
- Fit on the train set
- Transform the test set
```
Example of expected output:
```console
# One Hot encoder on: ['inv-nodes', 'deg-malig', 'breast', 'breast-quad', 'irradiat']
input: ohe.transform(df[ohe_cols])
@ -316,7 +304,7 @@ array(['inv-nodes_no', 'inv-nodes_yes', 'deg-malig_left',
```
3. Create one Ordinal encoder for all Ordinal features. The documentation of Scikit-learn is not clear on how to perform this on many columns at the same time. Here's a **hint**:
3. Create one Ordinal encoder for all Ordinal features. The documentation of Scikit-learn is not clear on how to perform this on many columns at the same time. Here's a **hint**:
If the ordinal data set is (subset of two columns but I keep all rows for this example):
@ -328,53 +316,49 @@ If the ordinal data set is (subset of two columns but I keep all rows for this e
| 3 | premeno | 3 |
| 4 | premeno | 2 |
The first step is to create a dictionnary:
The first step is to create a dictionnary:
```
```console
dict_ = {0: ['lt40', 'premeno' , 'ge40'], 1:[1,2,3]}
```
Then to instantiate an `OrdinalEncoder`:
```
Then to instantiate an `OrdinalEncoder`:
```console
oe = OrdinalEncoder(dict_)
```
Now that you have enough information:
- Fit on the train set
- Transform the test set
- Transform the test set
4. Use a `make_column_transformer` to combine the two Encoders.
4. Use a `make_column_transformer` to combine the two Encoders.
- Fit on the train set
- Transform the test set
- Transform the test set
*Hint: Check the first ressource*
*Hint: Check the first ressource*
**Note: The version 0.22 of Scikit-learn can't handle `get_feature_names` on `OrdinalEncoder`. If the column transformer contains an `OrdinalEncoder`, the method returns this error**:
**Note: The version 0.22 of Scikit-learn can't handle `get_feature_names` on `OrdinalEncoder`. If the column transformer contains an `OrdinalEncoder`, the method returns this error**:
```
```console
AttributeError: Transformer ordinalencoder (type OrdinalEncoder) does not provide get_feature_names.
```
**It means that if you want to use the Ordinal Encoder, you will have to create a variable that contains the columns name in the right order. This step is not required in that exercise**
**It means that if you want to use the Ordinal Encoder, you will have to create a variable that contains the columns name in the right order. This step is not required in that exercise**
Ressources:
Ressources:
- https://towardsdatascience.com/guide-to-encoding-categorical-features-using-scikit-learn-for-machine-learning-5048997a5c79
https://towardsdatascience.com/guide-to-encoding-categorical-features-using-scikit-learn-for-machine-learning-5048997a5c79
- https://machinelearningmastery.com/one-hot-encoding-for-categorical-data/
https://machinelearningmastery.com/one-hot-encoding-for-categorical-data/
## Correction
## Correction
1. This question is validated if the number of unique values per feature outputted are:
1. This question is validated if the number of unique values per feature outputted are:
```
```console
age 3
menopause 11
tumor-size 7
@ -387,10 +371,9 @@ irradiat 2
dtype: int64
```
2. This question is validated if the transformed test set by the `OneHotEncoder` fitted on the train set is:
2. This question is validated if the transformed test set by the `OneHotEncoder` fitted on the train set is:
```
```console
First 10 rows:
array([[1., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0., 1., 0.],
@ -405,9 +388,9 @@ dtype: int64
[1., 0., 0., 1., 0., 1., 0., 0., 0., 1., 0., 0., 1.]])
```
3. This question is validated if the transformed test set by the `OrdinalEncoder` fitted on the train set is:
3. This question is validated if the transformed test set by the `OrdinalEncoder` fitted on the train set is:
```
```console
First 10 rows:
array([[2., 2., 0., 1.],
@ -420,32 +403,28 @@ dtype: int64
[2., 2., 0., 0.],
[2., 5., 0., 2.],
[1., 3., 0., 0.]])
```
4. This question is validated if the column transformer transformed that is fitted on the X_train, transformed the X_test as:
4. This question is validated if the column transformer transformed that is fitted on the X_train, transformed the X_test as:
```
```console
# First 2 rows:
array([[1., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0., 1., 0., 2., 2., 0.,
1.],
[1., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0., 1., 0., 2., 2., 0.,
0.]])
```
# Exercise 6 Pipeline
# Exercise 6 Pipeline
The goal of this exercise is to learn to use the Scikit-learn object: Pipeline. The data set: used for this exercise is the `iris` data set.
The goal of this exercise is to learn to use the Scikit-learn object: Pipeline. The data set: used for this exercise is the `iris` data set.
Preliminary:
- Run the code below.
```
- Run the code below.
```console
iris = load_iris()
X, y = iris['data'], iris['target']
@ -455,35 +434,30 @@ Preliminary:
X[[4,15], 2] = np.nan
X[[40,135], 3] = np.nan
```
- Split the data set in a train set and test set (33%), fit the Pipeline on the train set and predict on the test set. Use `random_state=43`.
The pipeline you will implement has to contain 3 steps:
- Imputer (median)
- Standard Scaler
- LogisticRegression
- Split the data set in a train set and test set (33%), fit the Pipeline on the train set and predict on the test set. Use `random_state=43`.
1. Train the pipeline on the train set and predict on the test set. Give the score of the model on the test set.
The pipeline you will implement has to contain 3 steps:
- Imputer (median)
- Standard Scaler
- LogisticRegression
1. Train the pipeline on the train set and predict on the test set. Give the score of the model on the test set.
## Correction
## Correction
1. This question is validated if the prediction on the test set are:
1. This question is validated if the prediction on the test set are:
```
```console
array([0, 0, 2, 1, 2, 0, 2, 1, 1, 1, 0, 1, 2, 0, 1, 1, 0, 0, 2, 2, 0, 0,
0, 2, 2, 2, 0, 1, 0, 0, 1, 0, 1, 1, 2, 2, 1, 2, 1, 1, 1, 2, 1, 2,
0, 1, 1, 1, 1, 1])
```
and the score on the test set is **98%**.
and the score on the test set is **98%**.
*Note: Keep in mind that having a 98% accuracy is not common on real life data. Every time you have a score > 97% check that there's no leakage in the data.
On financial data set, the ratio signal to noise is low. Trying to forecast stock prices is a difficult problem. Having an accuracy higher than 70% should be interpreted as a warning to check data leakage !*
**Note: Keep in mind that having a 98% accuracy is not common on real life data. Every time you have a score > 97% check that there's no leakage in the data. On financial data set, the ratio signal to noise is low. Trying to forecast stock prices is a difficult problem. Having an accuracy higher than 70% should be interpreted as a warning to check data leakage !**

Loading…
Cancel
Save