3.3 KiB

Raw Blame History

Forest Cover Type Prediction - Correction

The goal of this project is to use cartographic variables to classify forest categories. You will have to analyse the data, create features and to train a machine learning model on the cartographic data to make it as accurate as possible.

Problem

The expected structure of the project is:

project
│   README.md
│   environment.yml    
│
└───data
│   │   train.csv
│   |   test.csv (not available first day)
|   |   covtype.info
│   
└───notebook
│   │   EDA.ipynb
|
|───scripts
|   │   preprocessing_feature_engineering.py
|   │   model_selection.py
│   |   predict.py
│   
└───results
    │   confusion_matrix_heatmap.png
    │   learning_curve_best_model.png
    │   test_predictions.csv
    │   best_model.pkl

The readme file contains a description of the project and explains how to run the code from an empty environment. It also gives a summary of the implementation of each python file. The preprocessing which is a key part should be decribed precisely. Finally, it should contain a conclusion that gives the performance of the strategy.
The environment has to contain all libraries used and their versions that are necessary to run the code.
The notebook is not evaluated.

1. Preprocessing and features engineering:

2. Model selection and predict

Data splitting

The data splitting structure is:

DATA
└───TRAIN FILE (0)
│   └───── Train (1):
│   |           Fold0:
|   |                  Train
|   |                  Validation
|   |           Fold1:
|   |                   Train
|   |                   Validation
... ...         ...
|   |
|   └───── Test (1)
│   
└─── TEST FILE (0)(available last day)

The train set (0) id divised in a train set (1) and test set (1). The ratio is less than 33%.
The cross validation splits the train set (1) is at least 5 folds. If the cross validation is stratified that's a good point but it is not a requirement.

Gridsearch

It contains at least these 5 different models:
- Gradient Boosting, KNN, Random Forest, SVM, Logistic Regression.

There are many options:

5 grid searches on 1 model
1 grid search on 5 models
1 grid search on a pipeline that contains the preprocessing
5 grid searches on a pipeline that contains the preprocessing

Training

Check that the target is removed from the X matrix

Results

Run predict.py on the test set, check that:

Test (last day) accuracy > 0.65.

Then, check:

Train accuracy score < 0.98. It can be checked on the learning curve. If you are not sure, load the model, load the training set (0), score on the training set (0).
The confusion matrix is represented as a DataFrame. Example:

The learning curve for the best model is plotted. Example:

Note: The green line on the plot shows the accuracy on the validation set not on the test set (1) and not on the test set (0).

The trained model is saved as a pickle file

3.3 KiB Raw Blame History