Branch-AI/piscine/week02/day03/ex05/README.md

# Exercise 5 Categorical variables

The goal of this exercise is to learn how to deal with Categorical variables with Ordinal Encoder, Label Encoder and One Hot Encoder. For this exercice I strongly suggest to use a recent version of `sklearn >= 0.24.1` to avoid issues with the Ordinal Encoder. 

Preliminary:

- Load the breast-cancer.csv file
- Drop `Class` column
- Drop NaN values
- Split the data in a train set and test set (test set size = 20% of the total size) with `random_state=43`.

1. Count the number of unique values per feature in the train set.

2. Identify the variables ordinal variables, nominal variables and the target. Compute a One Hot Encoder transformation on the test set for all categorical features (no ordinal) in the following order `['node-caps' , 'breast', 'breast-quad', 'irradiat']`. Here are the assumptions made on the variables:

```console
age: Ordinal
['90-99' > '80-89' > '70-79' > '60-69' > '50-59' > '40-49' > '30-39' > '20-29' > '10-19']

menopause: Ordinal
['ge40'> 'premeno' >'lt40']

tumor-size: Ordinal
['55-59' > '50-54' > '45-49' > '40-44' > '35-39' > '30-34' > '25-29' > '20-24' > '15-19' > '10-14' > '5-9' > '0-4']

inv-nodes: Ordinal 
['36-39' > '33-35' > '30-32' > '27-29' > '24-26' > '21-23' > '18-20' > '15-17' > '12-14' > '9-11' > '6-8' > '3-5' > '0-2']

node-caps: One Hot
['yes' 'no']

deg-malig: Ordinal
[3 > 2 > 1]

breast: One Hot 
['left' 'right']

breast-quad: One Hot 
['right_low' 'left_low' 'left_up' 'central' 'right_up']


irradiat: One Hot 
['recurrence-events' 'no-recurrence-events']
```

- Fit on the train set

- Transform the test set

Example of expected output:

```console
# One Hot encoder on: ['node-caps' , 'breast', 'breast-quad', 'irradiat']

input: ohe.transform(X_test[ohe_cols])[:10]
output:
array([[1., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0.],
       [1., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0.],
       [1., 0., 1., 0., 0., 0., 0., 1., 0., 1., 0.],
       [1., 0., 0., 1., 0., 1., 0., 0., 0., 1., 0.],
       [1., 0., 0., 1., 0., 0., 0., 1., 0., 1., 0.],
       [1., 0., 0., 1., 0., 0., 1., 0., 0., 1., 0.],
       [1., 0., 0., 1., 0., 0., 1., 0., 0., 1., 0.],
       [1., 0., 0., 1., 0., 1., 0., 0., 0., 1., 0.],
       [1., 0., 0., 1., 0., 0., 1., 0., 0., 1., 0.],
       [0., 1., 1., 0., 0., 0., 1., 0., 0., 0., 1.]])

input: ohe.get_feature_names(ohe_cols)
output: 
array(['node-caps_no', 'node-caps_yes', 'breast_left', 'breast_right',
       'breast-quad_central', 'breast-quad_left_low',
       'breast-quad_left_up', 'breast-quad_right_low',
       'breast-quad_right_up', 'irradiat_no', 'irradiat_yes'],
      dtype=object)

```

3. Create one Ordinal encoder for all Ordinal features in the following order `["menopause", "age", "tumor-size","inv-nodes", "deg-malig"]` on the test set. The documentation of Scikit-learn is not clear on how to perform this on many columns at the same time. Here's a **hint**:

If the ordinal data set is (subset of two columns but I keep all rows for this example):

    |    | menopause     |   deg-malig |
    |---:|:--------------|------------:|
    |  0 | premeno       |           3 |
    |  1 | ge40          |           1 |
    |  2 | ge40          |           2 |
    |  3 | premeno       |           3 |
    |  4 | premeno       |           2 |

The first step is to create a dictionnary or a list - the most recent version of sklearn take as input lists:

```console
dict_ = {0: ['lt40', 'premeno' , 'ge40'], 1:[1,2,3]}
```

Then to instantiate an `OrdinalEncoder`:

```console
oe = OrdinalEncoder(dict_)
```

Now that you have enough information:

- Fit on the train set
- Transform the test set

4. Use a `make_column_transformer` to combine the two Encoders.

- Fit on the train set
- Transform the test set

*Hint: Check the first ressource*

**Note: The version 0.22 of Scikit-learn can't handle `get_feature_names` on `OrdinalEncoder`. If the column transformer contains an `OrdinalEncoder`, the method returns this error**:

```console
AttributeError: Transformer ordinalencoder (type OrdinalEncoder) does not provide get_feature_names.
```

**It means that if you want to use the Ordinal Encoder, you will have to create a variable that contains the columns name in the right order. This step is not required in that exercise**

Ressources:

- https://towardsdatascience.com/guide-to-encoding-categorical-features-using-scikit-learn-for-machine-learning-5048997a5c79

- https://machinelearningmastery.com/one-hot-encoding-for-categorical-data/
feat: clean folders 2 years ago			`# Exercise 5 Categorical variables`

updated exercise with new data file 2 years ago			The goal of this exercise is to learn how to deal with Categorical variables with Ordinal Encoder, Label Encoder and One Hot Encoder. For this exercice I strongly suggest to use a recent version of `sklearn >= 0.24.1` to avoid issues with the Ordinal Encoder.
feat: clean folders 2 years ago
			`Preliminary:`

			`- Load the breast-cancer.csv file`
			- Drop `Class` column
			`- Drop NaN values`
			- Split the data in a train set and test set (test set size = 20% of the total size) with `random_state=43`.

			`1. Count the number of unique values per feature in the train set.`

updated exercise with new data file 2 years ago			2. Identify the variables ordinal variables, nominal variables and the target. Compute a One Hot Encoder transformation on the test set for all categorical features (no ordinal) in the following order `['node-caps' , 'breast', 'breast-quad', 'irradiat']`. Here are the assumptions made on the variables:
feat: clean folders 2 years ago
			```console
			`age: Ordinal`
updated exercise with new data file 2 years ago			`['90-99' > '80-89' > '70-79' > '60-69' > '50-59' > '40-49' > '30-39' > '20-29' > '10-19']`
feat: clean folders 2 years ago
			`menopause: Ordinal`
updated exercise with new data file 2 years ago			`['ge40'> 'premeno' >'lt40']`
feat: clean folders 2 years ago
			`tumor-size: Ordinal`
updated exercise with new data file 2 years ago			`['55-59' > '50-54' > '45-49' > '40-44' > '35-39' > '30-34' > '25-29' > '20-24' > '15-19' > '10-14' > '5-9' > '0-4']`
feat: clean folders 2 years ago
updated exercise with new data file 2 years ago			`inv-nodes: Ordinal`
			`['36-39' > '33-35' > '30-32' > '27-29' > '24-26' > '21-23' > '18-20' > '15-17' > '12-14' > '9-11' > '6-8' > '3-5' > '0-2']`

			`node-caps: One Hot`
feat: clean folders 2 years ago			`['yes' 'no']`

updated exercise with new data file 2 years ago			`deg-malig: Ordinal`
feat: clean folders 2 years ago			`[3 > 2 > 1]`

updated exercise with new data file 2 years ago			`breast: One Hot`
feat: clean folders 2 years ago			`['left' 'right']`

updated exercise with new data file 2 years ago			`breast-quad: One Hot`
feat: clean folders 2 years ago			`['right_low' 'left_low' 'left_up' 'central' 'right_up']`


			`irradiat: One Hot`
			`['recurrence-events' 'no-recurrence-events']`
			```

			`- Fit on the train set`

			`- Transform the test set`

			`Example of expected output:`

			```console
updated exercise with new data file 2 years ago			`# One Hot encoder on: ['node-caps' , 'breast', 'breast-quad', 'irradiat']`
feat: clean folders 2 years ago
updated exercise with new data file 2 years ago			`input: ohe.transform(X_test[ohe_cols])[:10]`
feat: clean folders 2 years ago			`output:`
updated exercise with new data file 2 years ago			`array([[1., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0.],`
			`[1., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0.],`
			`[1., 0., 1., 0., 0., 0., 0., 1., 0., 1., 0.],`
			`[1., 0., 0., 1., 0., 1., 0., 0., 0., 1., 0.],`
			`[1., 0., 0., 1., 0., 0., 0., 1., 0., 1., 0.],`
			`[1., 0., 0., 1., 0., 0., 1., 0., 0., 1., 0.],`
			`[1., 0., 0., 1., 0., 0., 1., 0., 0., 1., 0.],`
			`[1., 0., 0., 1., 0., 1., 0., 0., 0., 1., 0.],`
			`[1., 0., 0., 1., 0., 0., 1., 0., 0., 1., 0.],`
			`[0., 1., 1., 0., 0., 0., 1., 0., 0., 0., 1.]])`
feat: clean folders 2 years ago
			`input: ohe.get_feature_names(ohe_cols)`
			`output:`
updated exercise with new data file 2 years ago			`array(['node-caps_no', 'node-caps_yes', 'breast_left', 'breast_right',`
			`'breast-quad_central', 'breast-quad_left_low',`
			`'breast-quad_left_up', 'breast-quad_right_low',`
			`'breast-quad_right_up', 'irradiat_no', 'irradiat_yes'],`
feat: clean folders 2 years ago			`dtype=object)`

			```

updated exercise with new data file 2 years ago			3. Create one Ordinal encoder for all Ordinal features in the following order `["menopause", "age", "tumor-size","inv-nodes", "deg-malig"]` on the test set. The documentation of Scikit-learn is not clear on how to perform this on many columns at the same time. Here's a hint:
feat: clean folders 2 years ago
			`If the ordinal data set is (subset of two columns but I keep all rows for this example):`

updated exercise with new data file 2 years ago			`\| \| menopause \| deg-malig \|`
			`\|---:\|:--------------\|------------:\|`
			`\| 0 \| premeno \| 3 \|`
			`\| 1 \| ge40 \| 1 \|`
			`\| 2 \| ge40 \| 2 \|`
			`\| 3 \| premeno \| 3 \|`
			`\| 4 \| premeno \| 2 \|`
feat: clean folders 2 years ago
updated exercise with new data file 2 years ago			`The first step is to create a dictionnary or a list - the most recent version of sklearn take as input lists:`
feat: clean folders 2 years ago
			```console
			`dict_ = {0: ['lt40', 'premeno' , 'ge40'], 1:[1,2,3]}`
			```

			Then to instantiate an `OrdinalEncoder`:

			```console
			`oe = OrdinalEncoder(dict_)`
			```

			`Now that you have enough information:`

			`- Fit on the train set`
			`- Transform the test set`

			4. Use a `make_column_transformer` to combine the two Encoders.

			`- Fit on the train set`
			`- Transform the test set`

			`Hint: Check the first ressource`

			Note: The version 0.22 of Scikit-learn can't handle `get_feature_names` on `OrdinalEncoder`. If the column transformer contains an `OrdinalEncoder`, the method returns this error:

			```console
			`AttributeError: Transformer ordinalencoder (type OrdinalEncoder) does not provide get_feature_names.`
			```

			`It means that if you want to use the Ordinal Encoder, you will have to create a variable that contains the columns name in the right order. This step is not required in that exercise`

			`Ressources:`

			`- https://towardsdatascience.com/guide-to-encoding-categorical-features-using-scikit-learn-for-machine-learning-5048997a5c79`

updated exercise with new data file 2 years ago			`- https://machinelearningmastery.com/one-hot-encoding-for-categorical-data/`