You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

3.3 KiB

Forest Cover Type Prediction - Correction

The goal of this project is to use cartographic variables to classify forest categories. You will have to analyse the data, create features and to train a machine learning model on the cartographic data to make it as accurate as possible.

Problem

The expected structure of the project is:

project
│   README.md
│   environment.yml    
│
└───data
│   │   train.csv
│   |   test.csv (not available first day)
|   |   covtype.info
│   
└───notebook
│   │   EDA.ipynb
|
|───scripts
|   │   preprocessing_feature_engineering.py
|   │   model_selection.py
│   |   predict.py
│   
└───results
    │   confusion_matrix_heatmap.png
    │   learning_curve_best_model.png
    │   test_predictions.csv
    │   best_model.pkl

  • The readme file contains a description of the project and explains how to run the code from an empty environment. It also gives a summary of the implementation of each python file. The preprocessing which is a key part should be decribed precisely. Finally, it should contain a conclusion that gives the performance of the strategy.

  • The environment has to contain all libraries used and their versions that are necessary to run the code.

  • The notebook is not evaluated.

1. Preprocessing and features engineering:

2. Model selection and predict

Data splitting

The data splitting structure is:

DATA
└───TRAIN FILE (0)
│   └───── Train (1):
│   |           Fold0:
|   |                  Train
|   |                  Validation
|   |           Fold1:
|   |                   Train
|   |                   Validation
... ...         ...
|   |
|   └───── Test (1)
│   
└─── TEST FILE (0)(available last day)

  • The train set (0) id divised in a train set (1) and test set (1). The ratio is less than 33%.
  • The cross validation splits the train set (1) is at least 5 folds. If the cross validation is stratified that's a good point but it is not a requirement.

Gridsearch

  • It contains at least these 5 different models:
    • Gradient Boosting, KNN, Random Forest, SVM, Logistic Regression.

There are many options:

  • 5 grid searches on 1 model
  • 1 grid search on 5 models
  • 1 grid search on a pipeline that contains the preprocessing
  • 5 grid searches on a pipeline that contains the preprocessing

Training

  • Check that the target is removed from the X matrix

Results

Run predict.py on the test set, check that:

  • Test (last day) accuracy > 0.65.

Then, check:

  • Train accuracy score < 0.98. It can be checked on the learning curve. If you are not sure, load the model, load the training set (0), score on the training set (0).
  • The confusion matrix is represented as a DataFrame. Example: alt text
  • The learning curve for the best model is plotted. Example:

alt text

Note: The green line on the plot shows the accuracy on the validation set not on the test set (1) and not on the test set (0).

  • The trained model is saved as a pickle file