Browse Source

raid02: review and test

pull/16/head
lee 3 years ago
parent
commit
1a53754009
  1. 71
      one_md_per_day_format/piscine/Week2/weekend.md

71
one_md_per_day_format/piscine/Week2/weekend.md

@ -1,24 +1,24 @@
Forest Cover Type Prediction
The goal of this project is to use cartographic variables to classify forest categories. You will have to analyse the data, create features and to train a machine learning model on the cartographic data to make it as accurate as possible.
## Forest Cover Type Prediction
The goal of this project is to use cartographic variables to classify forest categories. You will have to analyse the data, create features and to train a machine learning model on the cartographic data to make it as accurate as possible.
## Data
The input files are `train.csv`, `test.csv` and `covtype.data`:
- `train.csv`
- `test.csv`
- `covtype.info`
The train data set is used to **analyse the data and calibrate the models**. The goal is to get the accuracy as high as possible on the test set. The test set will be available at the end of the last day to prevent from the overfitting of the test set.
The data is described in `covtype.info`.
- `train.csv`
- `test.csv`
- `covtype.info`
## Stucture
The train data set is used to **analyse the data and calibrate the models**. The goal is to get the accuracy as high as possible on the test set. The test set will be available at the end of the last day to prevent from the overfitting of the test set.
The data is described in `covtype.info`.
The structure of the project is:
## Structure
```
The structure of the project is:
```console
project
│ README.md
│ environment.yml
@ -41,27 +41,22 @@ project
│ test_predictions.csv
│ best_model.pkl
```
```
## 1. EDA and feature engineering:
## 1. EDA and feature engineering
- Create a Jupyter Notebook to analyse the data sets and perform EDA (Exploratory Data Analysis). This notebook won't be evaluated.
- Create a Jupyter Notebook to analyse the data sets and perform EDA (Exploratory Data Analysis). This notebook will not be evaluated.
- *Hint: Examples of interesting features*
- Distance to hydrology = sqrt((Horizontal_Distance_To_Hydrology)^2 + (Vertical_Distance_To_Hydrology)^2)
- Horizontal_Distance_To_Fire_Points - Horizontal_Distance_To_Roadways
- `Distance to hydrology = sqrt((Horizontal_Distance_To_Hydrology)^2 + (Vertical_Distance_To_Hydrology)^2)`
- `Horizontal_Distance_To_Fire_Points - Horizontal_Distance_To_Roadways`
## 2. Model Selection
## 2. Model Selection
The model selection approach is a key step because, t should return the best model and garanty that the results are reproducible on the final test set. The goal of this step is to make sure that the results on the test set are not due to test set overfitting. It implies to split the data set as shown below:
The model selection approach is a key step because, t should return the best model and guaranty that the results are reproducible on the final test set. The goal of this step is to make sure that the results on the test set are not due to test set overfitting. It implies to split the data set as shown below:
```
```console
DATA
└───TRAIN FILE (0)
│ └───── Train (1)
@ -77,30 +72,30 @@ DATA
└─── TEST FILE (0) (available last day)
```
```
**Rules:**
- Split train test
- Cross validation: at least 5 folds
- Grid search on at least 5 different models:
- Gradient Boosting, KNN, Random Forest, SVM, Logistic Regression. *Remember that for some model scaling the data is important and for others it doesn't matter.*
- Grid search on at least 5 different models:
- Gradient Boosting, KNN, Random Forest, SVM, Logistic Regression. *Remember that for some model scaling the data is important and for others it doesn't matter.*
- Train accuracy score < **0.98**. Train set (0). Write the result in the README.md.
- Test (last day) accuracy > **0.65**. TEst set (0). Write the result in the README.md
- Train accuracy score < **0.98**. Train set (0). Write the result in the `README.md`
- Test (last day) accuracy > **0.65**. Test set (0). Write the result in the `README.md`
- Display the confusion matrix for the best model in a DataFrame. Precise the index and columns names (True label and Predicted label)
- Plot the learning curve for the best model
- Save the trained model as a pickle file
- Plot the learning curve for the best model
- Save the trained model as a [pickle](https://www.datacamp.com/community/tutorials/pickle-python-tutorial) file
**Advice: As the grid search takes time, I suggest to prepare and test the code. Once you're confident it works, run the gridsearch the night and analyse the results**
> Advice: As the grid search takes time, I suggest to prepare and test the code. Once you are confident it works, run the gridsearch at night and analyse the results
*Hint*: The confusion matrix shows the misclassifications class per class. Try to detect if the model misclassifies badly one class with another. Then, do some research on the internet on the two forest cover types, find the differences and create some new features that underline these differences. More generaly, the methodology of a models learning is a cycle with several iterations. More details: https://serokell.io/blog/machine-learning-testing
**Hint**: The confusion matrix shows the misclassifications class per class. Try to detect if the model misclassifies badly one class with another. Then, do some research on the internet on the two forest cover types, find the differences and create some new features that underline these differences. More generally, the methodology of a models learning is a cycle with several iterations. More details [here](https://serokell.io/blog/machine-learning-testing)
## 3. Predict (last day)
Once you have selected the best model and you are confident it will perform well on new data, you're ready to predict on the test set:
Once you have selected the best model and you are confident it will perform well on new data, you're ready to predict on the test set:
- Load the trained model
- Load the trained model
- Predict on the test set and compute the accuracy
- Save the predictions in a csv file
- Add your score on the README.md
- Save the predictions in a csv file
- Add your score on the `README.md`

Loading…
Cancel
Save