3.6 KiB

Raw Blame History

Exercise 5 Categorical variables

The goal of this exercise is to learn how to deal with Categorical variables with Ordinal Encoder, Label Encoder and One Hot Encoder.

Preliminary:

Load the breast-cancer.csv file
Drop Class column
Drop NaN values
Split the data in a train set and test set (test set size = 20% of the total size) with random_state=43.

Count the number of unique values per feature in the train set.
Identify the variables ordinal variables, nominal variables and the target. Create one One Hot Encoder for all categorical features (no ordinal). Here are the assumptions made on the variables:

age: Ordinal
['ge40'> 'premeno' >'lt40']

menopause: Ordinal
['50-54' > '45-49' > '40-44' >  '35-39' > '30-34' > '25-29'> '20-24' > '15-19' > '10-14' > '5-9' > '0-4']

tumor-size: Ordinal
['15-17' >  '12-14' > '9-11' > '6-8' > '3-5' > '0-2']

inv-nodes: One Hot 
['yes' 'no']

node-caps: Ordinal
[3 > 2 > 1]

deg-malig: One Hot 
['left' 'right']

breast: One Hot 
['right_low' 'left_low' 'left_up' 'central' 'right_up']

breast-quad: One Hot 
['yes' 'no']

irradiat: One Hot 
['recurrence-events' 'no-recurrence-events']

Fit on the train set
Transform the test set

Example of expected output:

# One Hot encoder on: ['inv-nodes', 'deg-malig', 'breast', 'breast-quad', 'irradiat']

input: ohe.transform(df[ohe_cols]) 
output:
array([[0., 1., 0., ..., 0., 0., 1.],
    [1., 0., 0., ..., 0., 1., 0.],
    [1., 0., 1., ..., 0., 0., 1.],
    ...,
    [0., 1., 0., ..., 0., 1., 0.],
    [1., 0., 0., ..., 0., 1., 0.],
    [1., 0., 1., ..., 0., 1., 0.]])

input: ohe.get_feature_names(ohe_cols)
output: 
array(['inv-nodes_no', 'inv-nodes_yes', 'deg-malig_left',
       'deg-malig_right', 'breast_central', 'breast_left_low',
       'breast_left_up', 'breast_right_low', 'breast_right_up',
       'breast-quad_no', 'breast-quad_yes',
       'irradiat_no-recurrence-events', 'irradiat_recurrence-events'],
      dtype=object)

Create one Ordinal encoder for all Ordinal features. The documentation of Scikit-learn is not clear on how to perform this on many columns at the same time. Here's a hint:

If the ordinal data set is (subset of two columns but I keep all rows for this example):

|    | age     |   node-caps |
|---:|:--------|------------:|
|  0 | premeno |           3 |
|  1 | ge40    |           1 |
|  2 | ge40    |           2 |
|  3 | premeno |           3 |
|  4 | premeno |           2 |

The first step is to create a dictionnary:

dict_ = {0: ['lt40', 'premeno' , 'ge40'], 1:[1,2,3]}

Then to instantiate an OrdinalEncoder:

oe = OrdinalEncoder(dict_)

Now that you have enough information:

Fit on the train set
Transform the test set

Use a make_column_transformer to combine the two Encoders.

Fit on the train set
Transform the test set

Hint: Check the first ressource

Note: The version 0.22 of Scikit-learn can't handle get_feature_names on OrdinalEncoder. If the column transformer contains an OrdinalEncoder, the method returns this error:

AttributeError: Transformer ordinalencoder (type OrdinalEncoder) does not provide get_feature_names.

It means that if you want to use the Ordinal Encoder, you will have to create a variable that contains the columns name in the right order. This step is not required in that exercise

Ressources:

3.6 KiB Raw Blame History

Exercise 5 Categorical variables

3.6 KiB

Raw Blame History