The goal of this exercise is to learn how to deal with Categorical variables with Ordinal Encoder, Label Encoder and One Hot Encoder.
The goal of this exercise is to learn how to deal with Categorical variables with Ordinal Encoder, Label Encoder and One Hot Encoder. For this exercice I strongly suggest to use a recent version of `sklearn >= 0.24.1` to avoid issues with the Ordinal Encoder.
Preliminary:
@ -11,32 +11,33 @@ Preliminary:
1. Count the number of unique values per feature in the train set.
2. Identify the variables ordinal variables, nominal variables and the target. Create one One Hot Encoder for all categorical features (no ordinal). Here are the assumptions made on the variables:
2. Identify the variables ordinal variables, nominal variables and the target. Compute a One Hot Encoder transformation on the test set for all categorical features (no ordinal) in the following order `['node-caps' , 'breast', 'breast-quad', 'irradiat']`. Here are the assumptions made on the variables:
3. Create one Ordinal encoder for all Ordinal features. The documentation of Scikit-learn is not clear on how to perform this on many columns at the same time. Here's a **hint**:
3. Create one Ordinal encoder for all Ordinal features in the following order `["menopause", "age", "tumor-size","inv-nodes", "deg-malig"]` on the test set. The documentation of Scikit-learn is not clear on how to perform this on many columns at the same time. Here's a **hint**:
If the ordinal data set is (subset of two columns but I keep all rows for this example):
| | age | node-caps |
|---:|:--------|------------:|
| 0 | premeno | 3 |
| 1 | ge40 | 1 |
| 2 | ge40 | 2 |
| 3 | premeno | 3 |
| 4 | premeno | 2 |
| | menopause | deg-malig |
|---:|:--------------|------------:|
| 0 | premeno | 3 |
| 1 | ge40 | 1 |
| 2 | ge40 | 2 |
| 3 | premeno | 3 |
| 4 | premeno | 2 |
The first step is to create a dictionnary:
The first step is to create a dictionnary or a list - the most recent version of sklearn take as input lists: