Browse Source

feat: structure week 2 and correct week1

pull/42/head
Badr Ghazlane 3 years ago
parent
commit
1ac4058d3e
  1. 1
      one_exercise_per_file/week01/raid01/audit/readme.md
  2. 141
      one_exercise_per_file/week01/raid01/readme.md
  3. 699
      one_exercise_per_file/week02/day02/ex05/data/breast-cancer-wisconsin.data
  4. 126
      one_exercise_per_file/week02/day02/ex05/data/breast-cancer-wisconsin.names
  5. 286
      one_exercise_per_file/week02/day03/ex05/data/breast-cancer.csv
  6. 73
      one_exercise_per_file/week02/day03/ex05/data/breast_cancer_readme.txt
  7. 0
      one_exercise_per_file/week02/day04/audit/readme.md
  8. 1
      one_exercise_per_file/week02/day04/ex01/audit/readme.md
  9. 10
      one_exercise_per_file/week02/day04/ex01/readme.md
  10. 1
      one_exercise_per_file/week02/day04/ex02/audit/readme.md
  11. 10
      one_exercise_per_file/week02/day04/ex02/readme.md
  12. 28
      one_exercise_per_file/week02/day04/ex03/audit/readme.md
  13. 37
      one_exercise_per_file/week02/day04/ex03/readme.md
  14. 41
      one_exercise_per_file/week02/day04/ex04/audit/readme.md
  15. BIN
      one_exercise_per_file/week02/day04/ex04/images/w2_day4_ex4_q3.png
  16. 36
      one_exercise_per_file/week02/day04/ex04/readme.md
  17. 72
      one_exercise_per_file/week02/day04/ex05/audit/readme.md
  18. 55
      one_exercise_per_file/week02/day04/ex05/readme.md
  19. 31
      one_exercise_per_file/week02/day04/ex06/audit/readme.md
  20. 51
      one_exercise_per_file/week02/day04/ex06/readme.md
  21. 34
      one_exercise_per_file/week02/day04/readme.md
  22. 0
      one_exercise_per_file/week02/day05/audit/readme.md
  23. 18
      one_exercise_per_file/week02/day05/ex01/audit/readme.md
  24. 27
      one_exercise_per_file/week02/day05/ex01/readme.md
  25. 16
      one_exercise_per_file/week02/day05/ex02/audit/readme.md
  26. 53
      one_exercise_per_file/week02/day05/ex02/readme.md
  27. 33
      one_exercise_per_file/week02/day05/ex03/audit/readme.md
  28. 49
      one_exercise_per_file/week02/day05/ex03/readme.md
  29. 27
      one_exercise_per_file/week02/day05/ex04/audit/readme.md
  30. BIN
      one_exercise_per_file/week02/day05/ex04/images/w2_day5_ex5_q1.png
  31. BIN
      one_exercise_per_file/week02/day05/ex04/images/w2_day5_ex5_q2.png
  32. 58
      one_exercise_per_file/week02/day05/ex04/readme.md
  33. 26
      one_exercise_per_file/week02/day05/readme.md
  34. 111
      one_exercise_per_file/week02/raid02/audit/readme.md
  35. BIN
      one_exercise_per_file/week02/raid02/images/w2_weekend_confusion_matrix.png
  36. BIN
      one_exercise_per_file/week02/raid02/images/w2_weekend_learning_curve.png
  37. 101
      one_exercise_per_file/week02/raid02/readme.md

1
one_exercise_per_file/week01/raid01/audit/readme.md

@ -1,3 +1,4 @@
# RAID01 - Backtesting on the SP500 - correction
```
project

141
one_exercise_per_file/week01/raid01/readme.md

@ -1,44 +1,41 @@
# D0607 Piscine AI - Data Science
# RAID01 - Backtesting on the SP500
## SP data preprocessing
The goal of this project is to perform a Backtest on the SP500 constituents. The SP500 is an index the 500 biggest capitalization in the US.
## SP500 data preprocessing
The goal of this project is to perform a Backtest on the SP500 constituents. The SP500 is an index the 500 biggest capitalization in the US.
## Data
The input file are `stock_prices.csv` and :
- `sp500.csv` contains the SP500 data. The SP500 is a stock market index that measures the stock performance of 500 large companies listed on stock exchanges in the United States.
- `stock_prices.csv`: contains the close prices for all the companies that had been in the SP500. It contains a lot of missing data. The adjusted close price may be unavailable for three main reasons:
- The company doesn't exist at date d
- The company is not public, pas coté
- Its close price hasn't been reported
- Note: The quality of this data set is not good: some prices are wrong, there are some prices spikes, there are some prices adjusments (share split, dividend distribution) - the prices adjusment are corrected in the adjusted close. But I'm not providing this data for this project to let you understand what is bad quality data and how important it is to detect outliers and missing values. The idea is not to correct the full the data set manually but to correct the main problems.
*Note: The corrections won't fix the data, as a result the results may be abnormal compared to results from cleaned financial data. That's not a problem for this small project ! *
- `sp500.csv` contains the SP500 data. The SP500 is a stock market index that measures the stock performance of 500 large companies listed on stock exchanges in the United States.
## Problem
- `stock_prices.csv`: contains the close prices for all the companies that had been in the SP500. It contains a lot of missing data. The adjusted close price may be unavailable for three main reasons:
Once, preprocessed, this data will be used to generate a signal that is, for each asset at each date a metric that indicates if the asset price will increase the next month. At each date (once a month) we will take the 20 highest metrics and invest 1$ per company. This strategy is called stock picking. It consists in picking stock in an index and try to overperfom the index. Finally we will compare the performance of our strategy compared to the benchmark: the SP500.
- The company doesn't exist at date d
- The company is not public, pas coté
- Its close price hasn't been reported
- Note: The quality of this data set is not good: some prices are wrong, there are some prices spikes, there are some prices adjustments (share split, dividend distribution) - the prices adjustment are corrected in the adjusted close. But I'm not providing this data for this project to let you understand what is bad quality data and how important it is to detect outliers and missing values. The idea is not to correct the full data set manually, but to correct the main problems.
_Note: The corrections will not fix the data, as a result the results may be abnormal compared to results from cleaned financial data. That's not a problem for this small project !_
It is important to understand that the SP500 components change over time. The reason is simple: Facebook entered the SP500 in ???? and as there are 500 companies
## Problem
Once preprocessed this data, it will be used to generate a signal that is, for each asset at each date a metric that indicates if the asset price will increase the next month. At each date (once a month) we will take the 20 highest metrics and invest 1$ per company. This strategy is called **stock picking**. It consists in picking stock in an index and try to overperform the index. Finally we will compare the performance of our strategy compared to the benchmark: the SP500
It is important to understand that the SP500 components change over time. The reason is simple: Facebook entered the SP500 in 2013 thus meaning that another company had to be removed from the 500 companies.
The structure of the project is:
The structure of the project is:
```
```console
project
│ README.md
│ environment.yml
│ environment.yml
└───data
│ │ sp500.csv
│ | prices.csv
└───notebook
│ │ analysis.ipynb
|
@ -48,104 +45,99 @@ project
| │ create_signal.py
| | backtester.py
│ | main.py
└───results
│ plots
│ results.txt
│ outliers.txt
```
```
There are four parts:
There are four parts:
## 1. Preliminary
- Create a function that takes as input one CSV data file, optimizes the types to reduce its size and returns a memory optimized DataFrame.
- For float data the smaller data type used is `np.float32`
- These steps may help you to implement the memory_reducer:
- Create a function that takes as input one CSV data file. This function should optimize the types to reduce its size and returns a memory optimized DataFrame.
- For `float` data the smaller data type used is `np.float32`
- These steps may help you to implement the memory_reducer:
1. Iterate over every column
2. Determine if the column is numeric
3. Determine if the column can be represented by an integer
4. Find the min and the max value
5. Determine and apply the smallest datatype that can fit the range of values
1. Iterate over every column
2. Determine if the column is numeric
3. Determine if the column can be represented by an integer
4. Find the min and the max value
5. Determine and apply the smallest datatype that can fit the range of values
## 2. Data wrangling and preprocessing
- Create a Jupyter Notebook to analyse the data sets and perform EDA (Exploratory Data Analysis). This notebook should contain at least:
- Missing values analysis
- Outliers analysis (there are a lot of outliers)
- One of average price for companies for all variables (save the plot with the images).
- Describe at least 5 outliers ('ticker', 'date', 'price'). Put them in `outliers.txt` file with the 3 fields on the folder `results`.
## 2. Data wrangling and preprocessing:
_Note: create functions that generate the plots and save them in the images folder. Add a parameter `plot` with a default value `False` which doesn't return the plot. This will be useful for the correction to let people run your code without overriding your plots._
- Create a Jupyter Notebook to analyse the data sets and perform EDA (Exploratory Data Analysis). This notebook should contain at least:
- Missing values analysis
- Outliers analysis (there are a lot of outliers)
- One of average price for companies for all variables (save the plot with the images).
- Describe at least 5 outliers ('ticker', 'date', 'price'). Put them in `outliers.txt` file with the 3 fields on the folder `results`.
- Here is how the `prices` data should be preprocessed:
*Note: create functions that generate the plots and save them in the images folder. Add a parameter `plot` with a defaut value `False` which doesn't return the plot. This will be useful for the correction to let people run your code without overriding your plots.*
- Resample data on month and keep the last value
- Filter prices outliers: Remove prices outside of the range 0.1$, 10k$
- Compute monthly returns:
- Historical returns. **returns(current month) = price(current month) - price(previous month) / price(previous month)**
- Future returns. **returns(current month) = price(next month) - price(current month) / price(current month)**
- Here is how the `prices` data should be preprocessed:
- Resample data on month and keep the last value
- Filter prices outliers: Remove prices outside of the range 0.1$, 10k$
- Compute montly returns:
- Historical returns. **returns(current month) = price(current month) - price(previous month) / price(previous month)**
- Futur returns. **returns(current month) = price(next month) - price(current month) / price(current month)**
- Replace returns outliers by the last value available regarding the company. This corrects prices spikes that corresponds to a monthly return greater than 1 and smaller than -0.5. This correction shouldn't consider the 2008 and 2009 period as the financial crisis impacted the market brutally. **Don't forget that a value is considered as an outlier comparing to the other returns/prices of the same company**
- Replace returns outliers by the last value available regarding the company. This corrects prices spikes that corresponds to a monthly return greater than 1 and smaller than -0.5. This correction should not consider the 2008 and 2009 period as the financial crisis impacted the market brutally. **Don't forget that a value is considered as an outlier comparing to the other returns/prices of the same company**
At this stage the DataFrame should looks like this:
At this stage the DataFrame should looks like this:
| | Price | monthly_past_return | monthly_futur_return |
|:-----------------------------------------------------|---------:|----------------------:|-----------------------:|
| (Timestamp('2000-12-31 00:00:00', freq='M'), 'A') | 36.7304 | nan | -0.00365297 |
| (Timestamp('2000-12-31 00:00:00', freq='M'), 'AA') | 25.9505 | nan | 0.101194 |
| (Timestamp('2000-12-31 00:00:00', freq='M'), 'AAPL') | 1.00646 | nan | 0.452957 |
| (Timestamp('2000-12-31 00:00:00', freq='M'), 'ABC') | 11.4383 | nan | -0.0528713 |
| (Timestamp('2000-12-31 00:00:00', freq='M'), 'ABT') | 38.7945 | nan | -0.07205 |
1
| | Price | monthly_past_return | monthly_future_return |
| :--------------------------------------------------- | ------: | ------------------: | -------------------: |
| (Timestamp('2000-12-31 00:00:00', freq='M'), 'A') | 36.7304 | nan | -0.00365297 |
| (Timestamp('2000-12-31 00:00:00', freq='M'), 'AA') | 25.9505 | nan | 0.101194 |
| (Timestamp('2000-12-31 00:00:00', freq='M'), 'AAPL') | 1.00646 | nan | 0.452957 |
| (Timestamp('2000-12-31 00:00:00', freq='M'), 'ABC') | 11.4383 | nan | -0.0528713 |
| (Timestamp('2000-12-31 00:00:00', freq='M'), 'ABT') | 38.7945 | nan | -0.07205 |
- Fill the missing values using the last available value (same company)
- Drop the missing values that can't be filled
- Print `prices.isna().sum()`
- Here is how the `sp500.csv` data should be preprocessed:
- Resample data on month and keep the last value
- Compute historical monthly returns on the adjusted close
- Resample data on month and keep the last value
- Compute historical monthly returns on the adjusted close
## 3. Create signal
## 3. Create signal
At this stage we have a data set with features that we will leverage to get an investment signal. As previously said, we will focus on one single variable to create the signal: **monthly_past_return**. The signal will be the average of monthy returns of the previous year
At this stage we have a data set with features that we will leverage to get an investment signal. As previously said, we will focus on one single variable to create the signal: **monthly_past_return**. The signal will be the average of monthly returns of the previous year
The naive assumption made here is that if a stock has performed well the last year it will perform well the next month. Moreover, we assume that we can buy stocks as soon as we have the signal (the signal is available at the close of day d and we assume that we can buy the stock at close of day d. The assumption is acceptable while considering monthly returns because the difference between the close of day d and the open of day d+1 is small comparing to the monthly return)
The naive assumption made here is that if a stock has performed well the last year it will perform well the next month. Moreover, we assume that we can buy stocks as soon as we have the signal (the signal is available at the close of day `d` and we assume that we can buy the stock at close of day `d`. The assumption is acceptable while considering monthly returns, because the difference between the close of day `d` and the open of day `d+1` is small comparing to the monthly return)
- Create a column `average_return_1y`
- Create a column named `signal` that contains True if `average_return_1y` is among the 20 highest in the month `average_return_1y`.
- Create a column named `signal` that contains `True` if `average_return_1y` is among the 20 highest in the month `average_return_1y`.
## 4. Backtester
At this stage we have an investment signal that indicates each month what are the 20 companies we should invest 1$ on (1$ each). In order to check the strategies' performance we will backtest our investment signal.
At this stage we have an investment signal that indicates each month what are the 20 companies we should invest 1$ on (1$ each). In order to check the strategies and performance we will backtest our investment signal.
- Compute the PnL and the total return of our strategy without a for loop. Save the results in a text file `results.txt` in the folder `results`.
- Compute the PnL and the total return of the strategy that consists in investing 20$ each day on the SP500. Compare. Save the results in a text file `results.txt` in the folder `results`.
- Create a plot that shows the performance of the strategy over time for the SP500 and the Stock Picking 20 strategy.
A data point (x-axis: date, y-axis: cumulated_return) is: the **cumulated returns** from the beginning of the strategy at date t. Save the plot in the results folder.
- Create a plot that shows the performance of the strategy over time for the SP500 and the Stock Picking 20 strategy.
This plot is used a lot in Finance because it helps to compare a custom strategy with in index. In that case we say that the SP500 is used as **benchmark** for the Stock Picking Strategy.
A data point (x-axis: date, y-axis: cumulated_return) is: the **cumulated returns** from the beginning of the strategy at date `t`. Save the plot in the results folder.
> This plot is used a lot in Finance because it helps to compare a custom strategy with in index. In that case we say that the SP500 is used as **benchmark** for the Stock Picking Strategy.
![alt text][performance]
[performance]: images/w1_weekend_plot_pnl.png "Cumulative Performance"
[performance]: images/w1_weekend_plot_pnl.png 'Cumulative Performance'
## 5. Main
Here is a sketch of `main.py`.
```
main.py
Here is a sketch of `main.py`.
```python
# main.py
# import data
prices, sp500 = memory_reducer(paths)
@ -157,8 +149,7 @@ prices, sp500 = preprocessing(prices, sp500)
prices = create_signal(prices)
#backtest
backtest(prices, sp500)
```
**The command `python main.py` executes the code from data imports to the backtest and save the results.**
**The command `python main.py` executes the code from data imports to the backtest and save the results.**

699
one_exercise_per_file/week02/day02/ex05/data/breast-cancer-wisconsin.data

@ -0,0 +1,699 @@
1000025,5,1,1,1,2,1,3,1,1,2
1002945,5,4,4,5,7,10,3,2,1,2
1015425,3,1,1,1,2,2,3,1,1,2
1016277,6,8,8,1,3,4,3,7,1,2
1017023,4,1,1,3,2,1,3,1,1,2
1017122,8,10,10,8,7,10,9,7,1,4
1018099,1,1,1,1,2,10,3,1,1,2
1018561,2,1,2,1,2,1,3,1,1,2
1033078,2,1,1,1,2,1,1,1,5,2
1033078,4,2,1,1,2,1,2,1,1,2
1035283,1,1,1,1,1,1,3,1,1,2
1036172,2,1,1,1,2,1,2,1,1,2
1041801,5,3,3,3,2,3,4,4,1,4
1043999,1,1,1,1,2,3,3,1,1,2
1044572,8,7,5,10,7,9,5,5,4,4
1047630,7,4,6,4,6,1,4,3,1,4
1048672,4,1,1,1,2,1,2,1,1,2
1049815,4,1,1,1,2,1,3,1,1,2
1050670,10,7,7,6,4,10,4,1,2,4
1050718,6,1,1,1,2,1,3,1,1,2
1054590,7,3,2,10,5,10,5,4,4,4
1054593,10,5,5,3,6,7,7,10,1,4
1056784,3,1,1,1,2,1,2,1,1,2
1057013,8,4,5,1,2,?,7,3,1,4
1059552,1,1,1,1,2,1,3,1,1,2
1065726,5,2,3,4,2,7,3,6,1,4
1066373,3,2,1,1,1,1,2,1,1,2
1066979,5,1,1,1,2,1,2,1,1,2
1067444,2,1,1,1,2,1,2,1,1,2
1070935,1,1,3,1,2,1,1,1,1,2
1070935,3,1,1,1,1,1,2,1,1,2
1071760,2,1,1,1,2,1,3,1,1,2
1072179,10,7,7,3,8,5,7,4,3,4
1074610,2,1,1,2,2,1,3,1,1,2
1075123,3,1,2,1,2,1,2,1,1,2
1079304,2,1,1,1,2,1,2,1,1,2
1080185,10,10,10,8,6,1,8,9,1,4
1081791,6,2,1,1,1,1,7,1,1,2
1084584,5,4,4,9,2,10,5,6,1,4
1091262,2,5,3,3,6,7,7,5,1,4
1096800,6,6,6,9,6,?,7,8,1,2
1099510,10,4,3,1,3,3,6,5,2,4
1100524,6,10,10,2,8,10,7,3,3,4
1102573,5,6,5,6,10,1,3,1,1,4
1103608,10,10,10,4,8,1,8,10,1,4
1103722,1,1,1,1,2,1,2,1,2,2
1105257,3,7,7,4,4,9,4,8,1,4
1105524,1,1,1,1,2,1,2,1,1,2
1106095,4,1,1,3,2,1,3,1,1,2
1106829,7,8,7,2,4,8,3,8,2,4
1108370,9,5,8,1,2,3,2,1,5,4
1108449,5,3,3,4,2,4,3,4,1,4
1110102,10,3,6,2,3,5,4,10,2,4
1110503,5,5,5,8,10,8,7,3,7,4
1110524,10,5,5,6,8,8,7,1,1,4
1111249,10,6,6,3,4,5,3,6,1,4
1112209,8,10,10,1,3,6,3,9,1,4
1113038,8,2,4,1,5,1,5,4,4,4
1113483,5,2,3,1,6,10,5,1,1,4
1113906,9,5,5,2,2,2,5,1,1,4
1115282,5,3,5,5,3,3,4,10,1,4
1115293,1,1,1,1,2,2,2,1,1,2
1116116,9,10,10,1,10,8,3,3,1,4
1116132,6,3,4,1,5,2,3,9,1,4
1116192,1,1,1,1,2,1,2,1,1,2
1116998,10,4,2,1,3,2,4,3,10,4
1117152,4,1,1,1,2,1,3,1,1,2
1118039,5,3,4,1,8,10,4,9,1,4
1120559,8,3,8,3,4,9,8,9,8,4
1121732,1,1,1,1,2,1,3,2,1,2
1121919,5,1,3,1,2,1,2,1,1,2
1123061,6,10,2,8,10,2,7,8,10,4
1124651,1,3,3,2,2,1,7,2,1,2
1125035,9,4,5,10,6,10,4,8,1,4
1126417,10,6,4,1,3,4,3,2,3,4
1131294,1,1,2,1,2,2,4,2,1,2
1132347,1,1,4,1,2,1,2,1,1,2
1133041,5,3,1,2,2,1,2,1,1,2
1133136,3,1,1,1,2,3,3,1,1,2
1136142,2,1,1,1,3,1,2,1,1,2
1137156,2,2,2,1,1,1,7,1,1,2
1143978,4,1,1,2,2,1,2,1,1,2
1143978,5,2,1,1,2,1,3,1,1,2
1147044,3,1,1,1,2,2,7,1,1,2
1147699,3,5,7,8,8,9,7,10,7,4
1147748,5,10,6,1,10,4,4,10,10,4
1148278,3,3,6,4,5,8,4,4,1,4
1148873,3,6,6,6,5,10,6,8,3,4
1152331,4,1,1,1,2,1,3,1,1,2
1155546,2,1,1,2,3,1,2,1,1,2
1156272,1,1,1,1,2,1,3,1,1,2
1156948,3,1,1,2,2,1,1,1,1,2
1157734,4,1,1,1,2,1,3,1,1,2
1158247,1,1,1,1,2,1,2,1,1,2
1160476,2,1,1,1,2,1,3,1,1,2
1164066,1,1,1,1,2,1,3,1,1,2
1165297,2,1,1,2,2,1,1,1,1,2
1165790,5,1,1,1,2,1,3,1,1,2
1165926,9,6,9,2,10,6,2,9,10,4
1166630,7,5,6,10,5,10,7,9,4,4
1166654,10,3,5,1,10,5,3,10,2,4
1167439,2,3,4,4,2,5,2,5,1,4
1167471,4,1,2,1,2,1,3,1,1,2
1168359,8,2,3,1,6,3,7,1,1,4
1168736,10,10,10,10,10,1,8,8,8,4
1169049,7,3,4,4,3,3,3,2,7,4
1170419,10,10,10,8,2,10,4,1,1,4
1170420,1,6,8,10,8,10,5,7,1,4
1171710,1,1,1,1,2,1,2,3,1,2
1171710,6,5,4,4,3,9,7,8,3,4
1171795,1,3,1,2,2,2,5,3,2,2
1171845,8,6,4,3,5,9,3,1,1,4
1172152,10,3,3,10,2,10,7,3,3,4
1173216,10,10,10,3,10,8,8,1,1,4
1173235,3,3,2,1,2,3,3,1,1,2
1173347,1,1,1,1,2,5,1,1,1,2
1173347,8,3,3,1,2,2,3,2,1,2
1173509,4,5,5,10,4,10,7,5,8,4
1173514,1,1,1,1,4,3,1,1,1,2
1173681,3,2,1,1,2,2,3,1,1,2
1174057,1,1,2,2,2,1,3,1,1,2
1174057,4,2,1,1,2,2,3,1,1,2
1174131,10,10,10,2,10,10,5,3,3,4
1174428,5,3,5,1,8,10,5,3,1,4
1175937,5,4,6,7,9,7,8,10,1,4
1176406,1,1,1,1,2,1,2,1,1,2
1176881,7,5,3,7,4,10,7,5,5,4
1177027,3,1,1,1,2,1,3,1,1,2
1177399,8,3,5,4,5,10,1,6,2,4
1177512,1,1,1,1,10,1,1,1,1,2
1178580,5,1,3,1,2,1,2,1,1,2
1179818,2,1,1,1,2,1,3,1,1,2
1180194,5,10,8,10,8,10,3,6,3,4
1180523,3,1,1,1,2,1,2,2,1,2
1180831,3,1,1,1,3,1,2,1,1,2
1181356,5,1,1,1,2,2,3,3,1,2
1182404,4,1,1,1,2,1,2,1,1,2
1182410,3,1,1,1,2,1,1,1,1,2
1183240,4,1,2,1,2,1,2,1,1,2
1183246,1,1,1,1,1,?,2,1,1,2
1183516,3,1,1,1,2,1,1,1,1,2
1183911,2,1,1,1,2,1,1,1,1,2
1183983,9,5,5,4,4,5,4,3,3,4
1184184,1,1,1,1,2,5,1,1,1,2
1184241,2,1,1,1,2,1,2,1,1,2
1184840,1,1,3,1,2,?,2,1,1,2
1185609,3,4,5,2,6,8,4,1,1,4
1185610,1,1,1,1,3,2,2,1,1,2
1187457,3,1,1,3,8,1,5,8,1,2
1187805,8,8,7,4,10,10,7,8,7,4
1188472,1,1,1,1,1,1,3,1,1,2
1189266,7,2,4,1,6,10,5,4,3,4
1189286,10,10,8,6,4,5,8,10,1,4
1190394,4,1,1,1,2,3,1,1,1,2
1190485,1,1,1,1,2,1,1,1,1,2
1192325,5,5,5,6,3,10,3,1,1,4
1193091,1,2,2,1,2,1,2,1,1,2
1193210,2,1,1,1,2,1,3,1,1,2
1193683,1,1,2,1,3,?,1,1,1,2
1196295,9,9,10,3,6,10,7,10,6,4
1196915,10,7,7,4,5,10,5,7,2,4
1197080,4,1,1,1,2,1,3,2,1,2
1197270,3,1,1,1,2,1,3,1,1,2
1197440,1,1,1,2,1,3,1,1,7,2
1197510,5,1,1,1,2,?,3,1,1,2
1197979,4,1,1,1,2,2,3,2,1,2
1197993,5,6,7,8,8,10,3,10,3,4
1198128,10,8,10,10,6,1,3,1,10,4
1198641,3,1,1,1,2,1,3,1,1,2
1199219,1,1,1,2,1,1,1,1,1,2
1199731,3,1,1,1,2,1,1,1,1,2
1199983,1,1,1,1,2,1,3,1,1,2
1200772,1,1,1,1,2,1,2,1,1,2
1200847,6,10,10,10,8,10,10,10,7,4
1200892,8,6,5,4,3,10,6,1,1,4
1200952,5,8,7,7,10,10,5,7,1,4
1201834,2,1,1,1,2,1,3,1,1,2
1201936,5,10,10,3,8,1,5,10,3,4
1202125,4,1,1,1,2,1,3,1,1,2
1202812,5,3,3,3,6,10,3,1,1,4
1203096,1,1,1,1,1,1,3,1,1,2
1204242,1,1,1,1,2,1,1,1,1,2
1204898,6,1,1,1,2,1,3,1,1,2
1205138,5,8,8,8,5,10,7,8,1,4
1205579,8,7,6,4,4,10,5,1,1,4
1206089,2,1,1,1,1,1,3,1,1,2
1206695,1,5,8,6,5,8,7,10,1,4
1206841,10,5,6,10,6,10,7,7,10,4
1207986,5,8,4,10,5,8,9,10,1,4
1208301,1,2,3,1,2,1,3,1,1,2
1210963,10,10,10,8,6,8,7,10,1,4
1211202,7,5,10,10,10,10,4,10,3,4
1212232,5,1,1,1,2,1,2,1,1,2
1212251,1,1,1,1,2,1,3,1,1,2
1212422,3,1,1,1,2,1,3,1,1,2
1212422,4,1,1,1,2,1,3,1,1,2
1213375,8,4,4,5,4,7,7,8,2,2
1213383,5,1,1,4,2,1,3,1,1,2
1214092,1,1,1,1,2,1,1,1,1,2
1214556,3,1,1,1,2,1,2,1,1,2
1214966,9,7,7,5,5,10,7,8,3,4
1216694,10,8,8,4,10,10,8,1,1,4
1216947,1,1,1,1,2,1,3,1,1,2
1217051,5,1,1,1,2,1,3,1,1,2
1217264,1,1,1,1,2,1,3,1,1,2
1218105,5,10,10,9,6,10,7,10,5,4
1218741,10,10,9,3,7,5,3,5,1,4
1218860,1,1,1,1,1,1,3,1,1,2
1218860,1,1,1,1,1,1,3,1,1,2
1219406,5,1,1,1,1,1,3,1,1,2
1219525,8,10,10,10,5,10,8,10,6,4
1219859,8,10,8,8,4,8,7,7,1,4
1220330,1,1,1,1,2,1,3,1,1,2
1221863,10,10,10,10,7,10,7,10,4,4
1222047,10,10,10,10,3,10,10,6,1,4
1222936,8,7,8,7,5,5,5,10,2,4
1223282,1,1,1,1,2,1,2,1,1,2
1223426,1,1,1,1,2,1,3,1,1,2
1223793,6,10,7,7,6,4,8,10,2,4
1223967,6,1,3,1,2,1,3,1,1,2
1224329,1,1,1,2,2,1,3,1,1,2
1225799,10,6,4,3,10,10,9,10,1,4
1226012,4,1,1,3,1,5,2,1,1,4
1226612,7,5,6,3,3,8,7,4,1,4
1227210,10,5,5,6,3,10,7,9,2,4
1227244,1,1,1,1,2,1,2,1,1,2
1227481,10,5,7,4,4,10,8,9,1,4
1228152,8,9,9,5,3,5,7,7,1,4
1228311,1,1,1,1,1,1,3,1,1,2
1230175,10,10,10,3,10,10,9,10,1,4
1230688,7,4,7,4,3,7,7,6,1,4
1231387,6,8,7,5,6,8,8,9,2,4
1231706,8,4,6,3,3,1,4,3,1,2
1232225,10,4,5,5,5,10,4,1,1,4
1236043,3,3,2,1,3,1,3,6,1,2
1241232,3,1,4,1,2,?,3,1,1,2
1241559,10,8,8,2,8,10,4,8,10,4
1241679,9,8,8,5,6,2,4,10,4,4
1242364,8,10,10,8,6,9,3,10,10,4
1243256,10,4,3,2,3,10,5,3,2,4
1270479,5,1,3,3,2,2,2,3,1,2
1276091,3,1,1,3,1,1,3,1,1,2
1277018,2,1,1,1,2,1,3,1,1,2
128059,1,1,1,1,2,5,5,1,1,2
1285531,1,1,1,1,2,1,3,1,1,2
1287775,5,1,1,2,2,2,3,1,1,2
144888,8,10,10,8,5,10,7,8,1,4
145447,8,4,4,1,2,9,3,3,1,4
167528,4,1,1,1,2,1,3,6,1,2
169356,3,1,1,1,2,?,3,1,1,2
183913,1,2,2,1,2,1,1,1,1,2
191250,10,4,4,10,2,10,5,3,3,4
1017023,6,3,3,5,3,10,3,5,3,2
1100524,6,10,10,2,8,10,7,3,3,4
1116116,9,10,10,1,10,8,3,3,1,4
1168736,5,6,6,2,4,10,3,6,1,4
1182404,3,1,1,1,2,1,1,1,1,2
1182404,3,1,1,1,2,1,2,1,1,2
1198641,3,1,1,1,2,1,3,1,1,2
242970,5,7,7,1,5,8,3,4,1,2
255644,10,5,8,10,3,10,5,1,3,4
263538,5,10,10,6,10,10,10,6,5,4
274137,8,8,9,4,5,10,7,8,1,4
303213,10,4,4,10,6,10,5,5,1,4
314428,7,9,4,10,10,3,5,3,3,4
1182404,5,1,4,1,2,1,3,2,1,2
1198641,10,10,6,3,3,10,4,3,2,4
320675,3,3,5,2,3,10,7,1,1,4
324427,10,8,8,2,3,4,8,7,8,4
385103,1,1,1,1,2,1,3,1,1,2
390840,8,4,7,1,3,10,3,9,2,4
411453,5,1,1,1,2,1,3,1,1,2
320675,3,3,5,2,3,10,7,1,1,4
428903,7,2,4,1,3,4,3,3,1,4
431495,3,1,1,1,2,1,3,2,1,2
432809,3,1,3,1,2,?,2,1,1,2
434518,3,1,1,1,2,1,2,1,1,2
452264,1,1,1,1,2,1,2,1,1,2
456282,1,1,1,1,2,1,3,1,1,2
476903,10,5,7,3,3,7,3,3,8,4
486283,3,1,1,1,2,1,3,1,1,2
486662,2,1,1,2,2,1,3,1,1,2
488173,1,4,3,10,4,10,5,6,1,4
492268,10,4,6,1,2,10,5,3,1,4
508234,7,4,5,10,2,10,3,8,2,4
527363,8,10,10,10,8,10,10,7,3,4
529329,10,10,10,10,10,10,4,10,10,4
535331,3,1,1,1,3,1,2,1,1,2
543558,6,1,3,1,4,5,5,10,1,4
555977,5,6,6,8,6,10,4,10,4,4
560680,1,1,1,1,2,1,1,1,1,2
561477,1,1,1,1,2,1,3,1,1,2
563649,8,8,8,1,2,?,6,10,1,4
601265,10,4,4,6,2,10,2,3,1,4
606140,1,1,1,1,2,?,2,1,1,2
606722,5,5,7,8,6,10,7,4,1,4
616240,5,3,4,3,4,5,4,7,1,2
61634,5,4,3,1,2,?,2,3,1,2
625201,8,2,1,1,5,1,1,1,1,2
63375,9,1,2,6,4,10,7,7,2,4
635844,8,4,10,5,4,4,7,10,1,4
636130,1,1,1,1,2,1,3,1,1,2
640744,10,10,10,7,9,10,7,10,10,4
646904,1,1,1,1,2,1,3,1,1,2
653777,8,3,4,9,3,10,3,3,1,4
659642,10,8,4,4,4,10,3,10,4,4
666090,1,1,1,1,2,1,3,1,1,2
666942,1,1,1,1,2,1,3,1,1,2
667204,7,8,7,6,4,3,8,8,4,4
673637,3,1,1,1,2,5,5,1,1,2
684955,2,1,1,1,3,1,2,1,1,2
688033,1,1,1,1,2,1,1,1,1,2
691628,8,6,4,10,10,1,3,5,1,4
693702,1,1,1,1,2,1,1,1,1,2
704097,1,1,1,1,1,1,2,1,1,2
704168,4,6,5,6,7,?,4,9,1,2
706426,5,5,5,2,5,10,4,3,1,4
709287,6,8,7,8,6,8,8,9,1,4
718641,1,1,1,1,5,1,3,1,1,2
721482,4,4,4,4,6,5,7,3,1,2
730881,7,6,3,2,5,10,7,4,6,4
733639,3,1,1,1,2,?,3,1,1,2
733639,3,1,1,1,2,1,3,1,1,2
733823,5,4,6,10,2,10,4,1,1,4
740492,1,1,1,1,2,1,3,1,1,2
743348,3,2,2,1,2,1,2,3,1,2
752904,10,1,1,1,2,10,5,4,1,4
756136,1,1,1,1,2,1,2,1,1,2
760001,8,10,3,2,6,4,3,10,1,4
760239,10,4,6,4,5,10,7,1,1,4
76389,10,4,7,2,2,8,6,1,1,4
764974,5,1,1,1,2,1,3,1,2,2
770066,5,2,2,2,2,1,2,2,1,2
785208,5,4,6,6,4,10,4,3,1,4
785615,8,6,7,3,3,10,3,4,2,4
792744,1,1,1,1,2,1,1,1,1,2
797327,6,5,5,8,4,10,3,4,1,4
798429,1,1,1,1,2,1,3,1,1,2
704097,1,1,1,1,1,1,2,1,1,2
806423,8,5,5,5,2,10,4,3,1,4
809912,10,3,3,1,2,10,7,6,1,4
810104,1,1,1,1,2,1,3,1,1,2
814265,2,1,1,1,2,1,1,1,1,2
814911,1,1,1,1,2,1,1,1,1,2
822829,7,6,4,8,10,10,9,5,3,4
826923,1,1,1,1,2,1,1,1,1,2
830690,5,2,2,2,3,1,1,3,1,2
831268,1,1,1,1,1,1,1,3,1,2
832226,3,4,4,10,5,1,3,3,1,4
832567,4,2,3,5,3,8,7,6,1,4
836433,5,1,1,3,2,1,1,1,1,2
837082,2,1,1,1,2,1,3,1,1,2
846832,3,4,5,3,7,3,4,6,1,2
850831,2,7,10,10,7,10,4,9,4,4
855524,1,1,1,1,2,1,2,1,1,2
857774,4,1,1,1,3,1,2,2,1,2
859164,5,3,3,1,3,3,3,3,3,4
859350,8,10,10,7,10,10,7,3,8,4
866325,8,10,5,3,8,4,4,10,3,4
873549,10,3,5,4,3,7,3,5,3,4
877291,6,10,10,10,10,10,8,10,10,4
877943,3,10,3,10,6,10,5,1,4,4
888169,3,2,2,1,4,3,2,1,1,2
888523,4,4,4,2,2,3,2,1,1,2
896404,2,1,1,1,2,1,3,1,1,2
897172,2,1,1,1,2,1,2,1,1,2
95719,6,10,10,10,8,10,7,10,7,4
160296,5,8,8,10,5,10,8,10,3,4
342245,1,1,3,1,2,1,1,1,1,2
428598,1,1,3,1,1,1,2,1,1,2
492561,4,3,2,1,3,1,2,1,1,2
493452,1,1,3,1,2,1,1,1,1,2
493452,4,1,2,1,2,1,2,1,1,2
521441,5,1,1,2,2,1,2,1,1,2
560680,3,1,2,1,2,1,2,1,1,2
636437,1,1,1,1,2,1,1,1,1,2
640712,1,1,1,1,2,1,2,1,1,2
654244,1,1,1,1,1,1,2,1,1,2
657753,3,1,1,4,3,1,2,2,1,2
685977,5,3,4,1,4,1,3,1,1,2
805448,1,1,1,1,2,1,1,1,1,2
846423,10,6,3,6,4,10,7,8,4,4
1002504,3,2,2,2,2,1,3,2,1,2
1022257,2,1,1,1,2,1,1,1,1,2
1026122,2,1,1,1,2,1,1,1,1,2
1071084,3,3,2,2,3,1,1,2,3,2
1080233,7,6,6,3,2,10,7,1,1,4
1114570,5,3,3,2,3,1,3,1,1,2
1114570,2,1,1,1,2,1,2,2,1,2
1116715,5,1,1,1,3,2,2,2,1,2
1131411,1,1,1,2,2,1,2,1,1,2
1151734,10,8,7,4,3,10,7,9,1,4
1156017,3,1,1,1,2,1,2,1,1,2
1158247,1,1,1,1,1,1,1,1,1,2
1158405,1,2,3,1,2,1,2,1,1,2
1168278,3,1,1,1,2,1,2,1,1,2
1176187,3,1,1,1,2,1,3,1,1,2
1196263,4,1,1,1,2,1,1,1,1,2
1196475,3,2,1,1,2,1,2,2,1,2
1206314,1,2,3,1,2,1,1,1,1,2
1211265,3,10,8,7,6,9,9,3,8,4
1213784,3,1,1,1,2,1,1,1,1,2
1223003,5,3,3,1,2,1,2,1,1,2
1223306,3,1,1,1,2,4,1,1,1,2
1223543,1,2,1,3,2,1,1,2,1,2
1229929,1,1,1,1,2,1,2,1,1,2
1231853,4,2,2,1,2,1,2,1,1,2
1234554,1,1,1,1,2,1,2,1,1,2
1236837,2,3,2,2,2,2,3,1,1,2
1237674,3,1,2,1,2,1,2,1,1,2
1238021,1,1,1,1,2,1,2,1,1,2
1238464,1,1,1,1,1,?,2,1,1,2
1238633,10,10,10,6,8,4,8,5,1,4
1238915,5,1,2,1,2,1,3,1,1,2
1238948,8,5,6,2,3,10,6,6,1,4
1239232,3,3,2,6,3,3,3,5,1,2
1239347,8,7,8,5,10,10,7,2,1,4
1239967,1,1,1,1,2,1,2,1,1,2
1240337,5,2,2,2,2,2,3,2,2,2
1253505,2,3,1,1,5,1,1,1,1,2
1255384,3,2,2,3,2,3,3,1,1,2
1257200,10,10,10,7,10,10,8,2,1,4
1257648,4,3,3,1,2,1,3,3,1,2
1257815,5,1,3,1,2,1,2,1,1,2
1257938,3,1,1,1,2,1,1,1,1,2
1258549,9,10,10,10,10,10,10,10,1,4
1258556,5,3,6,1,2,1,1,1,1,2
1266154,8,7,8,2,4,2,5,10,1,4
1272039,1,1,1,1,2,1,2,1,1,2
1276091,2,1,1,1,2,1,2,1,1,2
1276091,1,3,1,1,2,1,2,2,1,2
1276091,5,1,1,3,4,1,3,2,1,2
1277629,5,1,1,1,2,1,2,2,1,2
1293439,3,2,2,3,2,1,1,1,1,2
1293439,6,9,7,5,5,8,4,2,1,2
1294562,10,8,10,1,3,10,5,1,1,4
1295186,10,10,10,1,6,1,2,8,1,4
527337,4,1,1,1,2,1,1,1,1,2
558538,4,1,3,3,2,1,1,1,1,2
566509,5,1,1,1,2,1,1,1,1,2
608157,10,4,3,10,4,10,10,1,1,4
677910,5,2,2,4,2,4,1,1,1,2
734111,1,1,1,3,2,3,1,1,1,2
734111,1,1,1,1,2,2,1,1,1,2
780555,5,1,1,6,3,1,2,1,1,2
827627,2,1,1,1,2,1,1,1,1,2
1049837,1,1,1,1,2,1,1,1,1,2
1058849,5,1,1,1,2,1,1,1,1,2
1182404,1,1,1,1,1,1,1,1,1,2
1193544,5,7,9,8,6,10,8,10,1,4
1201870,4,1,1,3,1,1,2,1,1,2
1202253,5,1,1,1,2,1,1,1,1,2
1227081,3,1,1,3,2,1,1,1,1,2
1230994,4,5,5,8,6,10,10,7,1,4
1238410,2,3,1,1,3,1,1,1,1,2
1246562,10,2,2,1,2,6,1,1,2,4
1257470,10,6,5,8,5,10,8,6,1,4
1259008,8,8,9,6,6,3,10,10,1,4
1266124,5,1,2,1,2,1,1,1,1,2
1267898,5,1,3,1,2,1,1,1,1,2
1268313,5,1,1,3,2,1,1,1,1,2
1268804,3,1,1,1,2,5,1,1,1,2
1276091,6,1,1,3,2,1,1,1,1,2
1280258,4,1,1,1,2,1,1,2,1,2
1293966,4,1,1,1,2,1,1,1,1,2
1296572,10,9,8,7,6,4,7,10,3,4
1298416,10,6,6,2,4,10,9,7,1,4
1299596,6,6,6,5,4,10,7,6,2,4
1105524,4,1,1,1,2,1,1,1,1,2
1181685,1,1,2,1,2,1,2,1,1,2
1211594,3,1,1,1,1,1,2,1,1,2
1238777,6,1,1,3,2,1,1,1,1,2
1257608,6,1,1,1,1,1,1,1,1,2
1269574,4,1,1,1,2,1,1,1,1,2
1277145,5,1,1,1,2,1,1,1,1,2
1287282,3,1,1,1,2,1,1,1,1,2
1296025,4,1,2,1,2,1,1,1,1,2
1296263,4,1,1,1,2,1,1,1,1,2
1296593,5,2,1,1,2,1,1,1,1,2
1299161,4,8,7,10,4,10,7,5,1,4
1301945,5,1,1,1,1,1,1,1,1,2
1302428,5,3,2,4,2,1,1,1,1,2
1318169,9,10,10,10,10,5,10,10,10,4
474162,8,7,8,5,5,10,9,10,1,4
787451,5,1,2,1,2,1,1,1,1,2
1002025,1,1,1,3,1,3,1,1,1,2
1070522,3,1,1,1,1,1,2,1,1,2
1073960,10,10,10,10,6,10,8,1,5,4
1076352,3,6,4,10,3,3,3,4,1,4
1084139,6,3,2,1,3,4,4,1,1,4
1115293,1,1,1,1,2,1,1,1,1,2
1119189,5,8,9,4,3,10,7,1,1,4
1133991,4,1,1,1,1,1,2,1,1,2
1142706,5,10,10,10,6,10,6,5,2,4
1155967,5,1,2,10,4,5,2,1,1,2
1170945,3,1,1,1,1,1,2,1,1,2
1181567,1,1,1,1,1,1,1,1,1,2
1182404,4,2,1,1,2,1,1,1,1,2
1204558,4,1,1,1,2,1,2,1,1,2
1217952,4,1,1,1,2,1,2,1,1,2
1224565,6,1,1,1,2,1,3,1,1,2
1238186,4,1,1,1,2,1,2,1,1,2
1253917,4,1,1,2,2,1,2,1,1,2
1265899,4,1,1,1,2,1,3,1,1,2
1268766,1,1,1,1,2,1,1,1,1,2
1277268,3,3,1,1,2,1,1,1,1,2
1286943,8,10,10,10,7,5,4,8,7,4
1295508,1,1,1,1,2,4,1,1,1,2
1297327,5,1,1,1,2,1,1,1,1,2
1297522,2,1,1,1,2,1,1,1,1,2
1298360,1,1,1,1,2,1,1,1,1,2
1299924,5,1,1,1,2,1,2,1,1,2
1299994,5,1,1,1,2,1,1,1,1,2
1304595,3,1,1,1,1,1,2,1,1,2
1306282,6,6,7,10,3,10,8,10,2,4
1313325,4,10,4,7,3,10,9,10,1,4
1320077,1,1,1,1,1,1,1,1,1,2
1320077,1,1,1,1,1,1,2,1,1,2
1320304,3,1,2,2,2,1,1,1,1,2
1330439,4,7,8,3,4,10,9,1,1,4
333093,1,1,1,1,3,1,1,1,1,2
369565,4,1,1,1,3,1,1,1,1,2
412300,10,4,5,4,3,5,7,3,1,4
672113,7,5,6,10,4,10,5,3,1,4
749653,3,1,1,1,2,1,2,1,1,2
769612,3,1,1,2,2,1,1,1,1,2
769612,4,1,1,1,2,1,1,1,1,2
798429,4,1,1,1,2,1,3,1,1,2
807657,6,1,3,2,2,1,1,1,1,2
8233704,4,1,1,1,1,1,2,1,1,2
837480,7,4,4,3,4,10,6,9,1,4
867392,4,2,2,1,2,1,2,1,1,2
869828,1,1,1,1,1,1,3,1,1,2
1043068,3,1,1,1,2,1,2,1,1,2
1056171,2,1,1,1,2,1,2,1,1,2
1061990,1,1,3,2,2,1,3,1,1,2
1113061,5,1,1,1,2,1,3,1,1,2
1116192,5,1,2,1,2,1,3,1,1,2
1135090,4,1,1,1,2,1,2,1,1,2
1145420,6,1,1,1,2,1,2,1,1,2
1158157,5,1,1,1,2,2,2,1,1,2
1171578,3,1,1,1,2,1,1,1,1,2
1174841,5,3,1,1,2,1,1,1,1,2
1184586,4,1,1,1,2,1,2,1,1,2
1186936,2,1,3,2,2,1,2,1,1,2
1197527,5,1,1,1,2,1,2,1,1,2
1222464,6,10,10,10,4,10,7,10,1,4
1240603,2,1,1,1,1,1,1,1,1,2
1240603,3,1,1,1,1,1,1,1,1,2
1241035,7,8,3,7,4,5,7,8,2,4
1287971,3,1,1,1,2,1,2,1,1,2
1289391,1,1,1,1,2,1,3,1,1,2
1299924,3,2,2,2,2,1,4,2,1,2
1306339,4,4,2,1,2,5,2,1,2,2
1313658,3,1,1,1,2,1,1,1,1,2
1313982,4,3,1,1,2,1,4,8,1,2
1321264,5,2,2,2,1,1,2,1,1,2
1321321,5,1,1,3,2,1,1,1,1,2
1321348,2,1,1,1,2,1,2,1,1,2
1321931,5,1,1,1,2,1,2,1,1,2
1321942,5,1,1,1,2,1,3,1,1,2
1321942,5,1,1,1,2,1,3,1,1,2
1328331,1,1,1,1,2,1,3,1,1,2
1328755,3,1,1,1,2,1,2,1,1,2
1331405,4,1,1,1,2,1,3,2,1,2
1331412,5,7,10,10,5,10,10,10,1,4
1333104,3,1,2,1,2,1,3,1,1,2
1334071,4,1,1,1,2,3,2,1,1,2
1343068,8,4,4,1,6,10,2,5,2,4
1343374,10,10,8,10,6,5,10,3,1,4
1344121,8,10,4,4,8,10,8,2,1,4
142932,7,6,10,5,3,10,9,10,2,4
183936,3,1,1,1,2,1,2,1,1,2
324382,1,1,1,1,2,1,2,1,1,2
378275,10,9,7,3,4,2,7,7,1,4
385103,5,1,2,1,2,1,3,1,1,2
690557,5,1,1,1,2,1,2,1,1,2
695091,1,1,1,1,2,1,2,1,1,2
695219,1,1,1,1,2,1,2,1,1,2
824249,1,1,1,1,2,1,3,1,1,2
871549,5,1,2,1,2,1,2,1,1,2
878358,5,7,10,6,5,10,7,5,1,4
1107684,6,10,5,5,4,10,6,10,1,4
1115762,3,1,1,1,2,1,1,1,1,2
1217717,5,1,1,6,3,1,1,1,1,2
1239420,1,1,1,1,2,1,1,1,1,2
1254538,8,10,10,10,6,10,10,10,1,4
1261751,5,1,1,1,2,1,2,2,1,2
1268275,9,8,8,9,6,3,4,1,1,4
1272166,5,1,1,1,2,1,1,1,1,2
1294261,4,10,8,5,4,1,10,1,1,4
1295529,2,5,7,6,4,10,7,6,1,4
1298484,10,3,4,5,3,10,4,1,1,4
1311875,5,1,2,1,2,1,1,1,1,2
1315506,4,8,6,3,4,10,7,1,1,4
1320141,5,1,1,1,2,1,2,1,1,2
1325309,4,1,2,1,2,1,2,1,1,2
1333063,5,1,3,1,2,1,3,1,1,2
1333495,3,1,1,1,2,1,2,1,1,2
1334659,5,2,4,1,1,1,1,1,1,2
1336798,3,1,1,1,2,1,2,1,1,2
1344449,1,1,1,1,1,1,2,1,1,2
1350568,4,1,1,1,2,1,2,1,1,2
1352663,5,4,6,8,4,1,8,10,1,4
188336,5,3,2,8,5,10,8,1,2,4
352431,10,5,10,3,5,8,7,8,3,4
353098,4,1,1,2,2,1,1,1,1,2
411453,1,1,1,1,2,1,1,1,1,2
557583,5,10,10,10,10,10,10,1,1,4
636375,5,1,1,1,2,1,1,1,1,2
736150,10,4,3,10,3,10,7,1,2,4
803531,5,10,10,10,5,2,8,5,1,4
822829,8,10,10,10,6,10,10,10,10,4
1016634,2,3,1,1,2,1,2,1,1,2
1031608,2,1,1,1,1,1,2,1,1,2
1041043,4,1,3,1,2,1,2,1,1,2
1042252,3,1,1,1,2,1,2,1,1,2
1057067,1,1,1,1,1,?,1,1,1,2
1061990,4,1,1,1,2,1,2,1,1,2
1073836,5,1,1,1,2,1,2,1,1,2
1083817,3,1,1,1,2,1,2,1,1,2
1096352,6,3,3,3,3,2,6,1,1,2
1140597,7,1,2,3,2,1,2,1,1,2
1149548,1,1,1,1,2,1,1,1,1,2
1174009,5,1,1,2,1,1,2,1,1,2
1183596,3,1,3,1,3,4,1,1,1,2
1190386,4,6,6,5,7,6,7,7,3,4
1190546,2,1,1,1,2,5,1,1,1,2
1213273,2,1,1,1,2,1,1,1,1,2
1218982,4,1,1,1,2,1,1,1,1,2
1225382,6,2,3,1,2,1,1,1,1,2
1235807,5,1,1,1,2,1,2,1,1,2
1238777,1,1,1,1,2,1,1,1,1,2
1253955,8,7,4,4,5,3,5,10,1,4
1257366,3,1,1,1,2,1,1,1,1,2
1260659,3,1,4,1,2,1,1,1,1,2
1268952,10,10,7,8,7,1,10,10,3,4
1275807,4,2,4,3,2,2,2,1,1,2
1277792,4,1,1,1,2,1,1,1,1,2
1277792,5,1,1,3,2,1,1,1,1,2
1285722,4,1,1,3,2,1,1,1,1,2
1288608,3,1,1,1,2,1,2,1,1,2
1290203,3,1,1,1,2,1,2,1,1,2
1294413,1,1,1,1,2,1,1,1,1,2
1299596,2,1,1,1,2,1,1,1,1,2
1303489,3,1,1,1,2,1,2,1,1,2
1311033,1,2,2,1,2,1,1,1,1,2
1311108,1,1,1,3,2,1,1,1,1,2
1315807,5,10,10,10,10,2,10,10,10,4
1318671,3,1,1,1,2,1,2,1,1,2
1319609,3,1,1,2,3,4,1,1,1,2
1323477,1,2,1,3,2,1,2,1,1,2
1324572,5,1,1,1,2,1,2,2,1,2
1324681,4,1,1,1,2,1,2,1,1,2
1325159,3,1,1,1,2,1,3,1,1,2
1326892,3,1,1,1,2,1,2,1,1,2
1330361,5,1,1,1,2,1,2,1,1,2
1333877,5,4,5,1,8,1,3,6,1,2
1334015,7,8,8,7,3,10,7,2,3,4
1334667,1,1,1,1,2,1,1,1,1,2
1339781,1,1,1,1,2,1,2,1,1,2
1339781,4,1,1,1,2,1,3,1,1,2
13454352,1,1,3,1,2,1,2,1,1,2
1345452,1,1,3,1,2,1,2,1,1,2
1345593,3,1,1,3,2,1,2,1,1,2
1347749,1,1,1,1,2,1,1,1,1,2
1347943,5,2,2,2,2,1,1,1,2,2
1348851,3,1,1,1,2,1,3,1,1,2
1350319,5,7,4,1,6,1,7,10,3,4
1350423,5,10,10,8,5,5,7,10,1,4
1352848,3,10,7,8,5,8,7,4,1,4
1353092,3,2,1,2,2,1,3,1,1,2
1354840,2,1,1,1,2,1,3,1,1,2
1354840,5,3,2,1,3,1,1,1,1,2
1355260,1,1,1,1,2,1,2,1,1,2
1365075,4,1,4,1,2,1,1,1,1,2
1365328,1,1,2,1,2,1,2,1,1,2
1368267,5,1,1,1,2,1,1,1,1,2
1368273,1,1,1,1,2,1,1,1,1,2
1368882,2,1,1,1,2,1,1,1,1,2
1369821,10,10,10,10,5,10,10,10,7,4
1371026,5,10,10,10,4,10,5,6,3,4
1371920,5,1,1,1,2,1,3,2,1,2
466906,1,1,1,1,2,1,1,1,1,2
466906,1,1,1,1,2,1,1,1,1,2
534555,1,1,1,1,2,1,1,1,1,2
536708,1,1,1,1,2,1,1,1,1,2
566346,3,1,1,1,2,1,2,3,1,2
603148,4,1,1,1,2,1,1,1,1,2
654546,1,1,1,1,2,1,1,1,8,2
654546,1,1,1,3,2,1,1,1,1,2
695091,5,10,10,5,4,5,4,4,1,4
714039,3,1,1,1,2,1,1,1,1,2
763235,3,1,1,1,2,1,2,1,2,2
776715,3,1,1,1,3,2,1,1,1,2
841769,2,1,1,1,2,1,1,1,1,2
888820,5,10,10,3,7,3,8,10,2,4
897471,4,8,6,4,3,4,10,6,1,4
897471,4,8,8,5,4,5,10,4,1,4

126
one_exercise_per_file/week02/day02/ex05/data/breast-cancer-wisconsin.names

@ -0,0 +1,126 @@
Citation Request:
This breast cancer databases was obtained from the University of Wisconsin
Hospitals, Madison from Dr. William H. Wolberg. If you publish results
when using this database, then please include this information in your
acknowledgements. Also, please cite one or more of:
1. O. L. Mangasarian and W. H. Wolberg: "Cancer diagnosis via linear
programming", SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18.
2. William H. Wolberg and O.L. Mangasarian: "Multisurface method of
pattern separation for medical diagnosis applied to breast cytology",
Proceedings of the National Academy of Sciences, U.S.A., Volume 87,
December 1990, pp 9193-9196.
3. O. L. Mangasarian, R. Setiono, and W.H. Wolberg: "Pattern recognition
via linear programming: Theory and application to medical diagnosis",
in: "Large-scale numerical optimization", Thomas F. Coleman and Yuying
Li, editors, SIAM Publications, Philadelphia 1990, pp 22-30.
4. K. P. Bennett & O. L. Mangasarian: "Robust linear programming
discrimination of two linearly inseparable sets", Optimization Methods
and Software 1, 1992, 23-34 (Gordon & Breach Science Publishers).
1. Title: Wisconsin Breast Cancer Database (January 8, 1991)
2. Sources:
-- Dr. WIlliam H. Wolberg (physician)
University of Wisconsin Hospitals
Madison, Wisconsin
USA
-- Donor: Olvi Mangasarian (mangasarian@cs.wisc.edu)
Received by David W. Aha (aha@cs.jhu.edu)
-- Date: 15 July 1992
3. Past Usage:
Attributes 2 through 10 have been used to represent instances.
Each instance has one of 2 possible classes: benign or malignant.
1. Wolberg,~W.~H., \& Mangasarian,~O.~L. (1990). Multisurface method of
pattern separation for medical diagnosis applied to breast cytology. In
{\it Proceedings of the National Academy of Sciences}, {\it 87},
9193--9196.
-- Size of data set: only 369 instances (at that point in time)
-- Collected classification results: 1 trial only
-- Two pairs of parallel hyperplanes were found to be consistent with
50% of the data
-- Accuracy on remaining 50% of dataset: 93.5%
-- Three pairs of parallel hyperplanes were found to be consistent with
67% of data
-- Accuracy on remaining 33% of dataset: 95.9%
2. Zhang,~J. (1992). Selecting typical instances in instance-based
learning. In {\it Proceedings of the Ninth International Machine
Learning Conference} (pp. 470--479). Aberdeen, Scotland: Morgan
Kaufmann.
-- Size of data set: only 369 instances (at that point in time)
-- Applied 4 instance-based learning algorithms
-- Collected classification results averaged over 10 trials
-- Best accuracy result:
-- 1-nearest neighbor: 93.7%
-- trained on 200 instances, tested on the other 169
-- Also of interest:
-- Using only typical instances: 92.2% (storing only 23.1 instances)
-- trained on 200 instances, tested on the other 169
4. Relevant Information:
Samples arrive periodically as Dr. Wolberg reports his clinical cases.
The database therefore reflects this chronological grouping of the data.
This grouping information appears immediately below, having been removed
from the data itself:
Group 1: 367 instances (January 1989)
Group 2: 70 instances (October 1989)
Group 3: 31 instances (February 1990)
Group 4: 17 instances (April 1990)
Group 5: 48 instances (August 1990)
Group 6: 49 instances (Updated January 1991)
Group 7: 31 instances (June 1991)
Group 8: 86 instances (November 1991)
-----------------------------------------
Total: 699 points (as of the donated datbase on 15 July 1992)
Note that the results summarized above in Past Usage refer to a dataset
of size 369, while Group 1 has only 367 instances. This is because it
originally contained 369 instances; 2 were removed. The following
statements summarizes changes to the original Group 1's set of data:
##### Group 1 : 367 points: 200B 167M (January 1989)
##### Revised Jan 10, 1991: Replaced zero bare nuclei in 1080185 & 1187805
##### Revised Nov 22,1991: Removed 765878,4,5,9,7,10,10,10,3,8,1 no record
##### : Removed 484201,2,7,8,8,4,3,10,3,4,1 zero epithelial
##### : Changed 0 to 1 in field 6 of sample 1219406
##### : Changed 0 to 1 in field 8 of following sample:
##### : 1182404,2,3,1,1,1,2,0,1,1,1
5. Number of Instances: 699 (as of 15 July 1992)
6. Number of Attributes: 10 plus the class attribute
7. Attribute Information: (class attribute has been moved to last column)
# Attribute Domain
-- -----------------------------------------
1. Sample code number id number
2. Clump Thickness 1 - 10
3. Uniformity of Cell Size 1 - 10
4. Uniformity of Cell Shape 1 - 10
5. Marginal Adhesion 1 - 10
6. Single Epithelial Cell Size 1 - 10
7. Bare Nuclei 1 - 10
8. Bland Chromatin 1 - 10
9. Normal Nucleoli 1 - 10
10. Mitoses 1 - 10
11. Class: (2 for benign, 4 for malignant)
8. Missing attribute values: 16
There are 16 instances in Groups 1 to 6 that contain a single missing
(i.e., unavailable) attribute value, now denoted by "?".
9. Class distribution:
Benign: 458 (65.5%)
Malignant: 241 (34.5%)

286
one_exercise_per_file/week02/day03/ex05/data/breast-cancer.csv

@ -0,0 +1,286 @@
"40-49","premeno","15-19","0-2","yes","3","right","left_up","no","recurrence-events"
"50-59","ge40","15-19","0-2","no","1","right","central","no","no-recurrence-events"
"50-59","ge40","35-39","0-2","no","2","left","left_low","no","recurrence-events"
"40-49","premeno","35-39","0-2","yes","3","right","left_low","yes","no-recurrence-events"
"40-49","premeno","30-34","3-5","yes","2","left","right_up","no","recurrence-events"
"50-59","premeno","25-29","3-5","no","2","right","left_up","yes","no-recurrence-events"
"50-59","ge40","40-44","0-2","no","3","left","left_up","no","no-recurrence-events"
"40-49","premeno","10-14","0-2","no","2","left","left_up","no","no-recurrence-events"
"40-49","premeno","0-4","0-2","no","2","right","right_low","no","no-recurrence-events"
"40-49","ge40","40-44","15-17","yes","2","right","left_up","yes","no-recurrence-events"
"50-59","premeno","25-29","0-2","no","2","left","left_low","no","no-recurrence-events"
"60-69","ge40","15-19","0-2","no","2","right","left_up","no","no-recurrence-events"
"50-59","ge40","30-34","0-2","no","1","right","central","no","no-recurrence-events"
"50-59","ge40","25-29","0-2","no","2","right","left_up","no","no-recurrence-events"
"40-49","premeno","25-29","0-2","no","2","left","left_low","yes","recurrence-events"
"30-39","premeno","20-24","0-2","no","3","left","central","no","no-recurrence-events"
"50-59","premeno","10-14","3-5","no","1","right","left_up","no","no-recurrence-events"
"60-69","ge40","15-19","0-2","no","2","right","left_up","no","no-recurrence-events"
"50-59","premeno","40-44","0-2","no","2","left","left_up","no","no-recurrence-events"
"50-59","ge40","20-24","0-2","no","3","left","left_up","no","no-recurrence-events"
"50-59","lt40","20-24","0-2",nan,"1","left","left_low","no","recurrence-events"
"60-69","ge40","40-44","3-5","no","2","right","left_up","yes","no-recurrence-events"
"50-59","ge40","15-19","0-2","no","2","right","left_low","no","no-recurrence-events"
"40-49","premeno","10-14","0-2","no","1","right","left_up","no","no-recurrence-events"
"30-39","premeno","15-19","6-8","yes","3","left","left_low","yes","recurrence-events"
"50-59","ge40","20-24","3-5","yes","2","right","left_up","no","no-recurrence-events"
"50-59","ge40","10-14","0-2","no","2","right","left_low","no","no-recurrence-events"
"40-49","premeno","10-14","0-2","no","1","right","left_up","no","no-recurrence-events"
"60-69","ge40","30-34","3-5","yes","3","left","left_low","no","no-recurrence-events"
"40-49","premeno","15-19","15-17","yes","3","left","left_low","no","recurrence-events"
"60-69","ge40","30-34","0-2","no","3","right","central","no","recurrence-events"
"60-69","ge40","25-29","3-5",nan,"1","right","left_low","yes","no-recurrence-events"
"50-59","ge40","25-29","0-2","no","3","left","right_up","no","no-recurrence-events"
"50-59","ge40","20-24","0-2","no","3","right","left_up","no","no-recurrence-events"
"40-49","premeno","30-34","0-2","no","1","left","left_low","yes","recurrence-events"
"30-39","premeno","15-19","0-2","no","1","left","left_low","no","no-recurrence-events"
"40-49","premeno","10-14","0-2","no","2","right","left_up","no","no-recurrence-events"
"60-69","ge40","45-49","6-8","yes","3","left","central","no","no-recurrence-events"
"40-49","ge40","20-24","0-2","no","3","left","left_low","no","no-recurrence-events"
"40-49","premeno","10-14","0-2","no","1","right","right_low","no","no-recurrence-events"
"30-39","premeno","35-39","0-2","no","3","left","left_low","no","recurrence-events"
"40-49","premeno","35-39","9-11","yes","2","right","right_up","yes","no-recurrence-events"
"60-69","ge40","25-29","0-2","no","2","right","left_low","no","no-recurrence-events"
"50-59","ge40","20-24","3-5","yes","3","right","right_up","no","recurrence-events"
"30-39","premeno","15-19","0-2","no","1","left","left_low","no","no-recurrence-events"
"50-59","premeno","30-34","0-2","no","3","left","right_up","no","recurrence-events"
"60-69","ge40","10-14","0-2","no","2","right","left_up","yes","no-recurrence-events"
"40-49","premeno","35-39","0-2","yes","3","right","left_up","yes","no-recurrence-events"
"50-59","premeno","50-54","0-2","yes","2","right","left_up","yes","no-recurrence-events"
"50-59","ge40","40-44","0-2","no","3","right","left_up","no","no-recurrence-events"
"70-79","ge40","15-19","9-11",nan,"1","left","left_low","yes","recurrence-events"
"50-59","lt40","30-34","0-2","no","3","right","left_up","no","no-recurrence-events"
"40-49","premeno","0-4","0-2","no","3","left","central","no","no-recurrence-events"
"70-79","ge40","40-44","0-2","no","1","right","right_up","no","no-recurrence-events"
"40-49","premeno","25-29","0-2",nan,"2","left","right_low","yes","no-recurrence-events"
"50-59","ge40","25-29","15-17","yes","3","right","left_up","no","no-recurrence-events"
"50-59","premeno","20-24","0-2","no","1","left","left_low","no","no-recurrence-events"
"50-59","ge40","35-39","15-17","no","3","left","left_low","no","no-recurrence-events"
"50-59","ge40","50-54","0-2","no","1","right","right_up","no","no-recurrence-events"
"30-39","premeno","0-4","0-2","no","2","right","central","no","recurrence-events"
"50-59","ge40","40-44","6-8","yes","3","left","left_low","yes","recurrence-events"
"40-49","premeno","30-34","0-2","no","2","right","right_up","yes","no-recurrence-events"
"40-49","ge40","20-24","0-2","no","3","left","left_up","no","no-recurrence-events"
"40-49","premeno","30-34","15-17","yes","3","left","left_low","no","recurrence-events"
"40-49","ge40","20-24","0-2","no","2","right","left_up","no","recurrence-events"
"50-59","ge40","15-19","0-2","no","1","right","central","no","no-recurrence-events"
"30-39","premeno","25-29","0-2","no","2","right","left_low","no","no-recurrence-events"
"60-69","ge40","15-19","0-2","no","2","left","left_low","no","no-recurrence-events"
"50-59","premeno","50-54","9-11","yes","2","right","left_up","no","recurrence-events"
"30-39","premeno","10-14","0-2","no","1","right","left_low","no","no-recurrence-events"
"50-59","premeno","25-29","3-5","yes","3","left","left_low","yes","recurrence-events"
"60-69","ge40","25-29","3-5",nan,"1","right","left_up","yes","no-recurrence-events"
"60-69","ge40","10-14","0-2","no","1","right","left_low","no","no-recurrence-events"
"50-59","ge40","30-34","6-8","yes","3","left","right_low","no","recurrence-events"
"30-39","premeno","25-29","6-8","yes","3","left","right_low","yes","recurrence-events"
"50-59","ge40","10-14","0-2","no","1","left","left_low","no","no-recurrence-events"
"50-59","premeno","15-19","0-2","no","1","left","left_low","no","no-recurrence-events"
"40-49","premeno","25-29","0-2","no","2","right","central","no","no-recurrence-events"
"40-49","premeno","25-29","0-2","no","3","left","right_up","no","recurrence-events"
"60-69","ge40","30-34","6-8","yes","2","right","right_up","no","no-recurrence-events"
"50-59","lt40","15-19","0-2","no","2","left","left_low","no","no-recurrence-events"
"40-49","premeno","25-29","0-2","no","2","right","left_low","no","no-recurrence-events"
"40-49","premeno","30-34","0-2","no","1","right","left_up","no","no-recurrence-events"
"60-69","ge40","15-19","0-2","no","2","left","left_up","yes","no-recurrence-events"
"30-39","premeno","0-4","0-2","no","2","right","central","no","no-recurrence-events"
"50-59","ge40","35-39","0-2","no","3","left","left_up","no","no-recurrence-events"
"40-49","premeno","40-44","0-2","no","1","right","left_up","no","no-recurrence-events"
"30-39","premeno","25-29","6-8","yes","2","right","left_up","yes","no-recurrence-events"
"50-59","ge40","20-24","0-2","no","1","right","left_low","no","no-recurrence-events"
"50-59","ge40","30-34","0-2","no","1","left","left_up","no","no-recurrence-events"
"60-69","ge40","20-24","0-2","no","1","right","left_up","no","recurrence-events"
"30-39","premeno","30-34","3-5","no","3","right","left_up","yes","recurrence-events"
"50-59","lt40","20-24","0-2",nan,"1","left","left_up","no","recurrence-events"
"50-59","premeno","10-14","0-2","no","2","right","left_up","no","no-recurrence-events"
"50-59","ge40","20-24","0-2","no","2","right","left_up","no","no-recurrence-events"
"40-49","premeno","45-49","0-2","no","2","left","left_low","yes","no-recurrence-events"
"30-39","premeno","40-44","0-2","no","1","left","left_up","no","recurrence-events"
"50-59","premeno","10-14","0-2","no","1","left","left_low","no","no-recurrence-events"
"60-69","ge40","30-34","0-2","no","3","right","left_up","yes","recurrence-events"
"40-49","premeno","35-39","0-2","no","1","right","left_up","no","recurrence-events"
"40-49","premeno","20-24","3-5","yes","2","left","left_low","yes","recurrence-events"
"50-59","premeno","15-19","0-2","no","2","left","left_low","no","recurrence-events"
"50-59","ge40","30-34","0-2","no","3","right","left_low","no","no-recurrence-events"
"60-69","ge40","20-24","0-2","no","2","left","left_up","no","no-recurrence-events"
"40-49","premeno","20-24","0-2","no","1","left","right_low","no","no-recurrence-events"
"60-69","ge40","30-34","3-5","yes","2","left","central","yes","recurrence-events"
"60-69","ge40","20-24","3-5","no","2","left","left_low","yes","recurrence-events"
"50-59","premeno","25-29","0-2","no","2","left","right_up","no","recurrence-events"
"50-59","ge40","30-34","0-2","no","1","right","right_up","no","no-recurrence-events"
"40-49","premeno","20-24","0-2","no","2","left","right_low","no","no-recurrence-events"
"60-69","ge40","15-19","0-2","no","1","right","left_up","no","no-recurrence-events"
"60-69","ge40","30-34","0-2","no","2","left","left_low","yes","no-recurrence-events"
"30-39","premeno","30-34","0-2","no","2","left","left_up","no","no-recurrence-events"
"30-39","premeno","40-44","3-5","no","3","right","right_up","yes","no-recurrence-events"
"60-69","ge40","5-9","0-2","no","1","left","central","no","no-recurrence-events"
"60-69","ge40","10-14","0-2","no","1","left","left_up","no","no-recurrence-events"
"40-49","premeno","30-34","6-8","yes","3","right","left_up","no","recurrence-events"
"60-69","ge40","10-14","0-2","no","1","left","left_up","no","no-recurrence-events"
"40-49","premeno","35-39","9-11","yes","2","right","left_up","yes","no-recurrence-events"
"40-49","premeno","20-24","0-2","no","1","right","left_low","no","no-recurrence-events"
"40-49","premeno","30-34","0-2","yes","3","right","right_up","no","recurrence-events"
"50-59","premeno","25-29","0-2","yes","2","left","left_up","no","no-recurrence-events"
"40-49","premeno","15-19","0-2","no","2","left","left_low","no","no-recurrence-events"
"30-39","premeno","35-39","9-11","yes","3","left","left_low","no","recurrence-events"
"30-39","premeno","10-14","0-2","no","2","left","right_low","no","no-recurrence-events"
"50-59","ge40","30-34","0-2","no","1","right","left_low","no","no-recurrence-events"
"60-69","ge40","30-34","0-2","no","2","left","left_up","no","no-recurrence-events"
"60-69","ge40","25-29","0-2","no","2","left","left_low","no","no-recurrence-events"
"40-49","premeno","15-19","0-2","no","2","left","left_up","no","recurrence-events"
"60-69","ge40","15-19","0-2","no","2","right","left_low","no","no-recurrence-events"
"40-49","premeno","30-34","0-2","no","2","left","right_low","no","no-recurrence-events"
"20-29","premeno","35-39","0-2","no","2","right","right_up","no","no-recurrence-events"
"40-49","premeno","30-34","0-2","no","3","right","right_up","no","recurrence-events"
"40-49","premeno","25-29","0-2","no","2","right","left_low","no","recurrence-events"
"30-39","premeno","30-34","0-2","no","3","left","left_low","no","no-recurrence-events"
"30-39","premeno","15-19","0-2","no","1","right","left_low","no","recurrence-events"
"50-59","ge40","0-4","0-2","no","1","right","central","no","no-recurrence-events"
"50-59","ge40","0-4","0-2","no","1","left","left_low","no","no-recurrence-events"
"60-69","ge40","50-54","0-2","no","3","right","left_up","no","recurrence-events"
"50-59","premeno","30-34","0-2","no","1","left","central","no","no-recurrence-events"
"60-69","ge40","20-24","15-17","yes","3","left","left_low","yes","recurrence-events"
"40-49","premeno","25-29","0-2","no","2","left","left_up","no","no-recurrence-events"
"40-49","premeno","30-34","3-5","no","2","right","left_up","no","recurrence-events"
"50-59","premeno","20-24","3-5","yes","2","left","left_low","no","no-recurrence-events"
"50-59","ge40","15-19","0-2","yes","2","left","central","yes","no-recurrence-events"
"50-59","premeno","10-14","0-2","no","3","left","left_low","no","no-recurrence-events"
"30-39","premeno","30-34","9-11","no","2","right","left_up","yes","recurrence-events"
"60-69","ge40","10-14","0-2","no","1","left","left_low","no","no-recurrence-events"
"40-49","premeno","40-44","0-2","no","2","right","left_low","no","no-recurrence-events"
"50-59","ge40","30-34","9-11",nan,"3","left","left_up","yes","no-recurrence-events"
"40-49","premeno","50-54","0-2","no","2","right","left_low","yes","recurrence-events"
"50-59","ge40","15-19","0-2","no","2","right","right_up","no","no-recurrence-events"
"50-59","ge40","40-44","3-5","yes","2","left","left_low","no","no-recurrence-events"
"30-39","premeno","25-29","3-5","yes","3","left","left_low","yes","recurrence-events"
"60-69","ge40","10-14","0-2","no","2","left","left_low","no","no-recurrence-events"
"60-69","lt40","10-14","0-2","no","1","left","right_up","no","no-recurrence-events"
"30-39","premeno","30-34","0-2","no","2","left","left_up","no","recurrence-events"
"30-39","premeno","20-24","3-5","yes","2","left","left_low","no","recurrence-events"
"50-59","ge40","10-14","0-2","no","1","right","left_up","no","no-recurrence-events"
"60-69","ge40","25-29","0-2","no","3","right","left_up","no","no-recurrence-events"
"50-59","ge40","25-29","3-5","yes","3","right","left_up","no","no-recurrence-events"
"40-49","premeno","30-34","6-8","no","2","left","left_up","no","no-recurrence-events"
"60-69","ge40","50-54","0-2","no","2","left","left_low","no","no-recurrence-events"
"50-59","premeno","30-34","0-2","no","3","left","left_low","no","no-recurrence-events"
"40-49","ge40","20-24","3-5","no","3","right","left_low","yes","recurrence-events"
"50-59","ge40","30-34","6-8","yes","2","left","right_low","yes","recurrence-events"
"60-69","ge40","25-29","3-5","no","2","right","right_up","no","recurrence-events"
"40-49","premeno","20-24","0-2","no","2","left","central","no","no-recurrence-events"
"40-49","premeno","20-24","0-2","no","2","left","left_up","no","no-recurrence-events"
"40-49","premeno","50-54","0-2","no","2","left","left_low","no","no-recurrence-events"
"50-59","ge40","20-24","0-2","no","2","right","central","no","recurrence-events"
"50-59","ge40","30-34","3-5","no","3","right","left_up","no","recurrence-events"
"40-49","ge40","25-29","0-2","no","2","left","left_low","no","no-recurrence-events"
"50-59","premeno","25-29","0-2","no","1","right","left_up","no","recurrence-events"
"40-49","premeno","40-44","3-5","yes","3","right","left_up","yes","no-recurrence-events"
"40-49","premeno","20-24","0-2","no","2","right","left_up","no","no-recurrence-events"
"40-49","premeno","20-24","3-5","no","2","right","left_up","no","no-recurrence-events"
"40-49","premeno","25-29","9-11","yes","3","right","left_up","no","recurrence-events"
"40-49","premeno","25-29","0-2","no","2","right","left_low","no","recurrence-events"
"40-49","premeno","20-24","0-2","no","1","right","right_up","no","no-recurrence-events"
"30-39","premeno","40-44","0-2","no","2","right","right_up","no","no-recurrence-events"
"60-69","ge40","10-14","6-8","yes","3","left","left_up","yes","recurrence-events"
"40-49","premeno","35-39","0-2","no","1","left","left_low","no","no-recurrence-events"
"50-59","ge40","30-34","3-5","no","3","left","left_low","no","recurrence-events"
"40-49","premeno","5-9","0-2","no","1","left","left_low","yes","no-recurrence-events"
"60-69","ge40","15-19","0-2","no","1","left","right_low","no","no-recurrence-events"
"40-49","premeno","30-34","0-2","no","3","right","right_up","no","no-recurrence-events"
"40-49","premeno","25-29","0-2","no","3","left","left_up","no","recurrence-events"
"50-59","ge40","5-9","0-2","no","2","right","right_up","no","no-recurrence-events"
"50-59","premeno","25-29","0-2","no","2","right","right_low","no","no-recurrence-events"
"50-59","premeno","25-29","0-2","no","2","left","right_up","no","recurrence-events"
"40-49","premeno","10-14","0-2","no","2","left","left_low","yes","no-recurrence-events"
"60-69","ge40","35-39","6-8","yes","3","left","left_low","no","recurrence-events"
"60-69","ge40","50-54","0-2","no","2","right","left_up","yes","no-recurrence-events"
"40-49","premeno","25-29","0-2","no","2","right","left_up","no","no-recurrence-events"
"30-39","premeno","20-24","3-5","no","2","right","central","no","no-recurrence-events"
"30-39","premeno","30-34","0-2","no","1","right","left_up","no","recurrence-events"
"60-69","lt40","30-34","0-2","no","1","left","left_low","no","no-recurrence-events"
"40-49","premeno","15-19","12-14","no","3","right","right_low","yes","no-recurrence-events"
"60-69","ge40","20-24","0-2","no","3","right","left_low","no","recurrence-events"
"30-39","premeno","5-9","0-2","no","2","left","right_low","no","no-recurrence-events"
"40-49","premeno","30-34","0-2","no","3","left","left_up","no","no-recurrence-events"
"60-69","ge40","30-34","0-2","no","3","left","left_low","no","no-recurrence-events"
"40-49","premeno","25-29","0-2","no","1","right","right_low","no","no-recurrence-events"
"40-49","premeno","25-29","0-2","no","1","left","right_low","no","no-recurrence-events"
"60-69","ge40","40-44","3-5","yes","3","right","left_low","no","recurrence-events"
"50-59","ge40","25-29","0-2","no","2","left","left_low","no","no-recurrence-events"
"50-59","premeno","30-34","0-2","no","3","right","left_up","yes","recurrence-events"
"40-49","ge40","30-34","3-5","no","3","left","left_low","no","recurrence-events"
"40-49","premeno","25-29","0-2","no","1","right","left_low","yes","no-recurrence-events"
"40-49","ge40","25-29","12-14","yes","3","left","right_low","yes","recurrence-events"
"40-49","premeno","40-44","0-2","no","1","left","left_low","no","recurrence-events"
"40-49","premeno","20-24","0-2","no","2","left","left_low","no","no-recurrence-events"
"50-59","ge40","25-29","0-2","no","1","left","right_low","no","no-recurrence-events"
"40-49","premeno","20-24","0-2","no","2","right","left_up","no","no-recurrence-events"
"70-79","ge40","40-44","0-2","no","1","right","left_up","no","no-recurrence-events"
"60-69","ge40","25-29","0-2","no","3","left","left_up","no","recurrence-events"
"50-59","premeno","25-29","0-2","no","2","left","left_low","no","no-recurrence-events"
"60-69","ge40","45-49","0-2","no","1","right","right_up","yes","recurrence-events"
"50-59","ge40","20-24","0-2","yes","2","right","left_up","no","no-recurrence-events"
"50-59","ge40","25-29","0-2","no","1","left","left_low","no","no-recurrence-events"
"50-59","ge40","20-24","0-2","no","3","left","left_up","no","no-recurrence-events"
"40-49","premeno","20-24","3-5","no","2","right","left_low","no","no-recurrence-events"
"50-59","ge40","35-39","0-2","no","2","left","left_up","no","no-recurrence-events"
"30-39","premeno","20-24","0-2","no","3","left","left_up","yes","recurrence-events"
"60-69","ge40","30-34","0-2","no","1","right","left_up","no","no-recurrence-events"
"60-69","ge40","25-29","0-2","no","3","right","left_low","no","no-recurrence-events"
"40-49","ge40","30-34","0-2","no","2","left","left_up","yes","no-recurrence-events"
"30-39","premeno","25-29","0-2","no","2","left","left_low","no","no-recurrence-events"
"40-49","premeno","20-24","0-2","no","2","left","left_low","no","recurrence-events"
"30-39","premeno","20-24","0-2","no","2","left","right_low","no","no-recurrence-events"
"40-49","premeno","10-14","0-2","no","2","right","left_low","no","no-recurrence-events"
"50-59","premeno","15-19","0-2","no","2","right","right_low","no","no-recurrence-events"
"50-59","premeno","25-29","0-2","no","1","right","left_up","no","no-recurrence-events"
"60-69","ge40","20-24","0-2","no","2","right","left_up","no","no-recurrence-events"
"60-69","ge40","40-44","0-2","no","2","right","left_low","no","recurrence-events"
"30-39","lt40","15-19","0-2","no","3","right","left_up","no","no-recurrence-events"
"40-49","premeno","30-34","12-14","yes","3","left","left_up","yes","recurrence-events"
"60-69","ge40","30-34","0-2","yes","2","right","right_up","yes","recurrence-events"
"50-59","ge40","40-44","6-8","yes","3","left","left_low","yes","recurrence-events"
"50-59","ge40","30-34","0-2","no","3","left",nan,"no","recurrence-events"
"70-79","ge40","10-14","0-2","no","2","left","central","no","no-recurrence-events"
"30-39","premeno","40-44","0-2","no","2","left","left_low","yes","no-recurrence-events"
"40-49","premeno","30-34","0-2","no","2","right","right_low","no","no-recurrence-events"
"40-49","premeno","30-34","0-2","no","1","left","left_low","no","no-recurrence-events"
"60-69","ge40","15-19","0-2","no","2","left","left_low","no","no-recurrence-events"
"40-49","premeno","10-14","0-2","no","2","left","left_low","no","no-recurrence-events"
"60-69","ge40","20-24","0-2","no","1","left","left_low","no","no-recurrence-events"
"50-59","ge40","10-14","0-2","no","1","left","left_up","no","no-recurrence-events"
"50-59","premeno","25-29","0-2","no","1","left","left_low","no","no-recurrence-events"
"50-59","ge40","30-34","9-11","yes","3","left","right_low","yes","recurrence-events"
"50-59","ge40","10-14","0-2","no","2","left","left_low","no","no-recurrence-events"
"40-49","premeno","30-34","0-2","no","1","left","right_up","no","no-recurrence-events"
"70-79","ge40","0-4","0-2","no","1","left","right_low","no","no-recurrence-events"
"40-49","premeno","25-29","0-2","no","3","right","left_up","yes","no-recurrence-events"
"50-59","premeno","25-29","0-2","no","3","right","left_low","yes","recurrence-events"
"50-59","ge40","40-44","0-2","no","2","left","left_low","no","no-recurrence-events"
"60-69","ge40","25-29","0-2","no","3","left","right_low","yes","recurrence-events"
"40-49","premeno","30-34","3-5","yes","2","right","left_low","no","no-recurrence-events"
"50-59","ge40","20-24","0-2","no","2","left","left_up","no","recurrence-events"
"70-79","ge40","20-24","0-2","no","3","left","left_up","no","no-recurrence-events"
"30-39","premeno","25-29","0-2","no","1","left","central","no","no-recurrence-events"
"60-69","ge40","30-34","0-2","no","2","left","left_low","no","no-recurrence-events"
"40-49","premeno","20-24","3-5","yes","2","right","right_up","yes","recurrence-events"
"50-59","ge40","30-34","9-11",nan,"3","left","left_low","yes","no-recurrence-events"
"50-59","ge40","0-4","0-2","no","2","left","central","no","no-recurrence-events"
"40-49","premeno","20-24","0-2","no","3","right","left_low","yes","no-recurrence-events"
"30-39","premeno","35-39","0-2","no","3","left","left_low","no","recurrence-events"
"60-69","ge40","30-34","0-2","no","1","left","left_up","no","no-recurrence-events"
"60-69","ge40","20-24","0-2","no","1","left","left_low","no","no-recurrence-events"
"50-59","ge40","25-29","6-8","no","3","left","left_low","yes","recurrence-events"
"50-59","premeno","35-39","15-17","yes","3","right","right_up","no","recurrence-events"
"30-39","premeno","20-24","3-5","yes","2","right","left_up","yes","no-recurrence-events"
"40-49","premeno","20-24","6-8","no","2","right","left_low","yes","no-recurrence-events"
"50-59","ge40","35-39","0-2","no","3","left","left_low","no","no-recurrence-events"
"50-59","premeno","35-39","0-2","no","2","right","left_up","no","no-recurrence-events"
"40-49","premeno","25-29","0-2","no","2","left","left_up","yes","no-recurrence-events"
"40-49","premeno","35-39","0-2","no","2","right","right_up","no","no-recurrence-events"
"50-59","premeno","30-34","3-5","yes","2","left","left_low","yes","no-recurrence-events"
"40-49","premeno","20-24","0-2","no","2","right","right_up","no","no-recurrence-events"
"60-69","ge40","15-19","0-2","no","3","right","left_up","yes","no-recurrence-events"
"50-59","ge40","30-34","6-8","yes","2","left","left_low","no","no-recurrence-events"
"50-59","premeno","25-29","3-5","yes","2","left","left_low","yes","no-recurrence-events"
"30-39","premeno","30-34","6-8","yes","2","right","right_up","no","no-recurrence-events"
"50-59","premeno","15-19","0-2","no","2","right","left_low","no","no-recurrence-events"
"50-59","ge40","40-44","0-2","no","3","left","right_up","no","no-recurrence-events"
1 40-49 premeno 15-19 0-2 yes 3 right left_up no recurrence-events
2 50-59 ge40 15-19 0-2 no 1 right central no no-recurrence-events
3 50-59 ge40 35-39 0-2 no 2 left left_low no recurrence-events
4 40-49 premeno 35-39 0-2 yes 3 right left_low yes no-recurrence-events
5 40-49 premeno 30-34 3-5 yes 2 left right_up no recurrence-events
6 50-59 premeno 25-29 3-5 no 2 right left_up yes no-recurrence-events
7 50-59 ge40 40-44 0-2 no 3 left left_up no no-recurrence-events
8 40-49 premeno 10-14 0-2 no 2 left left_up no no-recurrence-events
9 40-49 premeno 0-4 0-2 no 2 right right_low no no-recurrence-events
10 40-49 ge40 40-44 15-17 yes 2 right left_up yes no-recurrence-events
11 50-59 premeno 25-29 0-2 no 2 left left_low no no-recurrence-events
12 60-69 ge40 15-19 0-2 no 2 right left_up no no-recurrence-events
13 50-59 ge40 30-34 0-2 no 1 right central no no-recurrence-events
14 50-59 ge40 25-29 0-2 no 2 right left_up no no-recurrence-events
15 40-49 premeno 25-29 0-2 no 2 left left_low yes recurrence-events
16 30-39 premeno 20-24 0-2 no 3 left central no no-recurrence-events
17 50-59 premeno 10-14 3-5 no 1 right left_up no no-recurrence-events
18 60-69 ge40 15-19 0-2 no 2 right left_up no no-recurrence-events
19 50-59 premeno 40-44 0-2 no 2 left left_up no no-recurrence-events
20 50-59 ge40 20-24 0-2 no 3 left left_up no no-recurrence-events
21 50-59 lt40 20-24 0-2 nan 1 left left_low no recurrence-events
22 60-69 ge40 40-44 3-5 no 2 right left_up yes no-recurrence-events
23 50-59 ge40 15-19 0-2 no 2 right left_low no no-recurrence-events
24 40-49 premeno 10-14 0-2 no 1 right left_up no no-recurrence-events
25 30-39 premeno 15-19 6-8 yes 3 left left_low yes recurrence-events
26 50-59 ge40 20-24 3-5 yes 2 right left_up no no-recurrence-events
27 50-59 ge40 10-14 0-2 no 2 right left_low no no-recurrence-events
28 40-49 premeno 10-14 0-2 no 1 right left_up no no-recurrence-events
29 60-69 ge40 30-34 3-5 yes 3 left left_low no no-recurrence-events
30 40-49 premeno 15-19 15-17 yes 3 left left_low no recurrence-events
31 60-69 ge40 30-34 0-2 no 3 right central no recurrence-events
32 60-69 ge40 25-29 3-5 nan 1 right left_low yes no-recurrence-events
33 50-59 ge40 25-29 0-2 no 3 left right_up no no-recurrence-events
34 50-59 ge40 20-24 0-2 no 3 right left_up no no-recurrence-events
35 40-49 premeno 30-34 0-2 no 1 left left_low yes recurrence-events
36 30-39 premeno 15-19 0-2 no 1 left left_low no no-recurrence-events
37 40-49 premeno 10-14 0-2 no 2 right left_up no no-recurrence-events
38 60-69 ge40 45-49 6-8 yes 3 left central no no-recurrence-events
39 40-49 ge40 20-24 0-2 no 3 left left_low no no-recurrence-events
40 40-49 premeno 10-14 0-2 no 1 right right_low no no-recurrence-events
41 30-39 premeno 35-39 0-2 no 3 left left_low no recurrence-events
42 40-49 premeno 35-39 9-11 yes 2 right right_up yes no-recurrence-events
43 60-69 ge40 25-29 0-2 no 2 right left_low no no-recurrence-events
44 50-59 ge40 20-24 3-5 yes 3 right right_up no recurrence-events
45 30-39 premeno 15-19 0-2 no 1 left left_low no no-recurrence-events
46 50-59 premeno 30-34 0-2 no 3 left right_up no recurrence-events
47 60-69 ge40 10-14 0-2 no 2 right left_up yes no-recurrence-events
48 40-49 premeno 35-39 0-2 yes 3 right left_up yes no-recurrence-events
49 50-59 premeno 50-54 0-2 yes 2 right left_up yes no-recurrence-events
50 50-59 ge40 40-44 0-2 no 3 right left_up no no-recurrence-events
51 70-79 ge40 15-19 9-11 nan 1 left left_low yes recurrence-events
52 50-59 lt40 30-34 0-2 no 3 right left_up no no-recurrence-events
53 40-49 premeno 0-4 0-2 no 3 left central no no-recurrence-events
54 70-79 ge40 40-44 0-2 no 1 right right_up no no-recurrence-events
55 40-49 premeno 25-29 0-2 nan 2 left right_low yes no-recurrence-events
56 50-59 ge40 25-29 15-17 yes 3 right left_up no no-recurrence-events
57 50-59 premeno 20-24 0-2 no 1 left left_low no no-recurrence-events
58 50-59 ge40 35-39 15-17 no 3 left left_low no no-recurrence-events
59 50-59 ge40 50-54 0-2 no 1 right right_up no no-recurrence-events
60 30-39 premeno 0-4 0-2 no 2 right central no recurrence-events
61 50-59 ge40 40-44 6-8 yes 3 left left_low yes recurrence-events
62 40-49 premeno 30-34 0-2 no 2 right right_up yes no-recurrence-events
63 40-49 ge40 20-24 0-2 no 3 left left_up no no-recurrence-events
64 40-49 premeno 30-34 15-17 yes 3 left left_low no recurrence-events
65 40-49 ge40 20-24 0-2 no 2 right left_up no recurrence-events
66 50-59 ge40 15-19 0-2 no 1 right central no no-recurrence-events
67 30-39 premeno 25-29 0-2 no 2 right left_low no no-recurrence-events
68 60-69 ge40 15-19 0-2 no 2 left left_low no no-recurrence-events
69 50-59 premeno 50-54 9-11 yes 2 right left_up no recurrence-events
70 30-39 premeno 10-14 0-2 no 1 right left_low no no-recurrence-events
71 50-59 premeno 25-29 3-5 yes 3 left left_low yes recurrence-events
72 60-69 ge40 25-29 3-5 nan 1 right left_up yes no-recurrence-events
73 60-69 ge40 10-14 0-2 no 1 right left_low no no-recurrence-events
74 50-59 ge40 30-34 6-8 yes 3 left right_low no recurrence-events
75 30-39 premeno 25-29 6-8 yes 3 left right_low yes recurrence-events
76 50-59 ge40 10-14 0-2 no 1 left left_low no no-recurrence-events
77 50-59 premeno 15-19 0-2 no 1 left left_low no no-recurrence-events
78 40-49 premeno 25-29 0-2 no 2 right central no no-recurrence-events
79 40-49 premeno 25-29 0-2 no 3 left right_up no recurrence-events
80 60-69 ge40 30-34 6-8 yes 2 right right_up no no-recurrence-events
81 50-59 lt40 15-19 0-2 no 2 left left_low no no-recurrence-events
82 40-49 premeno 25-29 0-2 no 2 right left_low no no-recurrence-events
83 40-49 premeno 30-34 0-2 no 1 right left_up no no-recurrence-events
84 60-69 ge40 15-19 0-2 no 2 left left_up yes no-recurrence-events
85 30-39 premeno 0-4 0-2 no 2 right central no no-recurrence-events
86 50-59 ge40 35-39 0-2 no 3 left left_up no no-recurrence-events
87 40-49 premeno 40-44 0-2 no 1 right left_up no no-recurrence-events
88 30-39 premeno 25-29 6-8 yes 2 right left_up yes no-recurrence-events
89 50-59 ge40 20-24 0-2 no 1 right left_low no no-recurrence-events
90 50-59 ge40 30-34 0-2 no 1 left left_up no no-recurrence-events
91 60-69 ge40 20-24 0-2 no 1 right left_up no recurrence-events
92 30-39 premeno 30-34 3-5 no 3 right left_up yes recurrence-events
93 50-59 lt40 20-24 0-2 nan 1 left left_up no recurrence-events
94 50-59 premeno 10-14 0-2 no 2 right left_up no no-recurrence-events
95 50-59 ge40 20-24 0-2 no 2 right left_up no no-recurrence-events
96 40-49 premeno 45-49 0-2 no 2 left left_low yes no-recurrence-events
97 30-39 premeno 40-44 0-2 no 1 left left_up no recurrence-events
98 50-59 premeno 10-14 0-2 no 1 left left_low no no-recurrence-events
99 60-69 ge40 30-34 0-2 no 3 right left_up yes recurrence-events
100 40-49 premeno 35-39 0-2 no 1 right left_up no recurrence-events
101 40-49 premeno 20-24 3-5 yes 2 left left_low yes recurrence-events
102 50-59 premeno 15-19 0-2 no 2 left left_low no recurrence-events
103 50-59 ge40 30-34 0-2 no 3 right left_low no no-recurrence-events
104 60-69 ge40 20-24 0-2 no 2 left left_up no no-recurrence-events
105 40-49 premeno 20-24 0-2 no 1 left right_low no no-recurrence-events
106 60-69 ge40 30-34 3-5 yes 2 left central yes recurrence-events
107 60-69 ge40 20-24 3-5 no 2 left left_low yes recurrence-events
108 50-59 premeno 25-29 0-2 no 2 left right_up no recurrence-events
109 50-59 ge40 30-34 0-2 no 1 right right_up no no-recurrence-events
110 40-49 premeno 20-24 0-2 no 2 left right_low no no-recurrence-events
111 60-69 ge40 15-19 0-2 no 1 right left_up no no-recurrence-events
112 60-69 ge40 30-34 0-2 no 2 left left_low yes no-recurrence-events
113 30-39 premeno 30-34 0-2 no 2 left left_up no no-recurrence-events
114 30-39 premeno 40-44 3-5 no 3 right right_up yes no-recurrence-events
115 60-69 ge40 5-9 0-2 no 1 left central no no-recurrence-events
116 60-69 ge40 10-14 0-2 no 1 left left_up no no-recurrence-events
117 40-49 premeno 30-34 6-8 yes 3 right left_up no recurrence-events
118 60-69 ge40 10-14 0-2 no 1 left left_up no no-recurrence-events
119 40-49 premeno 35-39 9-11 yes 2 right left_up yes no-recurrence-events
120 40-49 premeno 20-24 0-2 no 1 right left_low no no-recurrence-events
121 40-49 premeno 30-34 0-2 yes 3 right right_up no recurrence-events
122 50-59 premeno 25-29 0-2 yes 2 left left_up no no-recurrence-events
123 40-49 premeno 15-19 0-2 no 2 left left_low no no-recurrence-events
124 30-39 premeno 35-39 9-11 yes 3 left left_low no recurrence-events
125 30-39 premeno 10-14 0-2 no 2 left right_low no no-recurrence-events
126 50-59 ge40 30-34 0-2 no 1 right left_low no no-recurrence-events
127 60-69 ge40 30-34 0-2 no 2 left left_up no no-recurrence-events
128 60-69 ge40 25-29 0-2 no 2 left left_low no no-recurrence-events
129 40-49 premeno 15-19 0-2 no 2 left left_up no recurrence-events
130 60-69 ge40 15-19 0-2 no 2 right left_low no no-recurrence-events
131 40-49 premeno 30-34 0-2 no 2 left right_low no no-recurrence-events
132 20-29 premeno 35-39 0-2 no 2 right right_up no no-recurrence-events
133 40-49 premeno 30-34 0-2 no 3 right right_up no recurrence-events
134 40-49 premeno 25-29 0-2 no 2 right left_low no recurrence-events
135 30-39 premeno 30-34 0-2 no 3 left left_low no no-recurrence-events
136 30-39 premeno 15-19 0-2 no 1 right left_low no recurrence-events
137 50-59 ge40 0-4 0-2 no 1 right central no no-recurrence-events
138 50-59 ge40 0-4 0-2 no 1 left left_low no no-recurrence-events
139 60-69 ge40 50-54 0-2 no 3 right left_up no recurrence-events
140 50-59 premeno 30-34 0-2 no 1 left central no no-recurrence-events
141 60-69 ge40 20-24 15-17 yes 3 left left_low yes recurrence-events
142 40-49 premeno 25-29 0-2 no 2 left left_up no no-recurrence-events
143 40-49 premeno 30-34 3-5 no 2 right left_up no recurrence-events
144 50-59 premeno 20-24 3-5 yes 2 left left_low no no-recurrence-events
145 50-59 ge40 15-19 0-2 yes 2 left central yes no-recurrence-events
146 50-59 premeno 10-14 0-2 no 3 left left_low no no-recurrence-events
147 30-39 premeno 30-34 9-11 no 2 right left_up yes recurrence-events
148 60-69 ge40 10-14 0-2 no 1 left left_low no no-recurrence-events
149 40-49 premeno 40-44 0-2 no 2 right left_low no no-recurrence-events
150 50-59 ge40 30-34 9-11 nan 3 left left_up yes no-recurrence-events
151 40-49 premeno 50-54 0-2 no 2 right left_low yes recurrence-events
152 50-59 ge40 15-19 0-2 no 2 right right_up no no-recurrence-events
153 50-59 ge40 40-44 3-5 yes 2 left left_low no no-recurrence-events
154 30-39 premeno 25-29 3-5 yes 3 left left_low yes recurrence-events
155 60-69 ge40 10-14 0-2 no 2 left left_low no no-recurrence-events
156 60-69 lt40 10-14 0-2 no 1 left right_up no no-recurrence-events
157 30-39 premeno 30-34 0-2 no 2 left left_up no recurrence-events
158 30-39 premeno 20-24 3-5 yes 2 left left_low no recurrence-events
159 50-59 ge40 10-14 0-2 no 1 right left_up no no-recurrence-events
160 60-69 ge40 25-29 0-2 no 3 right left_up no no-recurrence-events
161 50-59 ge40 25-29 3-5 yes 3 right left_up no no-recurrence-events
162 40-49 premeno 30-34 6-8 no 2 left left_up no no-recurrence-events
163 60-69 ge40 50-54 0-2 no 2 left left_low no no-recurrence-events
164 50-59 premeno 30-34 0-2 no 3 left left_low no no-recurrence-events
165 40-49 ge40 20-24 3-5 no 3 right left_low yes recurrence-events
166 50-59 ge40 30-34 6-8 yes 2 left right_low yes recurrence-events
167 60-69 ge40 25-29 3-5 no 2 right right_up no recurrence-events
168 40-49 premeno 20-24 0-2 no 2 left central no no-recurrence-events
169 40-49 premeno 20-24 0-2 no 2 left left_up no no-recurrence-events
170 40-49 premeno 50-54 0-2 no 2 left left_low no no-recurrence-events
171 50-59 ge40 20-24 0-2 no 2 right central no recurrence-events
172 50-59 ge40 30-34 3-5 no 3 right left_up no recurrence-events
173 40-49 ge40 25-29 0-2 no 2 left left_low no no-recurrence-events
174 50-59 premeno 25-29 0-2 no 1 right left_up no recurrence-events
175 40-49 premeno 40-44 3-5 yes 3 right left_up yes no-recurrence-events
176 40-49 premeno 20-24 0-2 no 2 right left_up no no-recurrence-events
177 40-49 premeno 20-24 3-5 no 2 right left_up no no-recurrence-events
178 40-49 premeno 25-29 9-11 yes 3 right left_up no recurrence-events
179 40-49 premeno 25-29 0-2 no 2 right left_low no recurrence-events
180 40-49 premeno 20-24 0-2 no 1 right right_up no no-recurrence-events
181 30-39 premeno 40-44 0-2 no 2 right right_up no no-recurrence-events
182 60-69 ge40 10-14 6-8 yes 3 left left_up yes recurrence-events
183 40-49 premeno 35-39 0-2 no 1 left left_low no no-recurrence-events
184 50-59 ge40 30-34 3-5 no 3 left left_low no recurrence-events
185 40-49 premeno 5-9 0-2 no 1 left left_low yes no-recurrence-events
186 60-69 ge40 15-19 0-2 no 1 left right_low no no-recurrence-events
187 40-49 premeno 30-34 0-2 no 3 right right_up no no-recurrence-events
188 40-49 premeno 25-29 0-2 no 3 left left_up no recurrence-events
189 50-59 ge40 5-9 0-2 no 2 right right_up no no-recurrence-events
190 50-59 premeno 25-29 0-2 no 2 right right_low no no-recurrence-events
191 50-59 premeno 25-29 0-2 no 2 left right_up no recurrence-events
192 40-49 premeno 10-14 0-2 no 2 left left_low yes no-recurrence-events
193 60-69 ge40 35-39 6-8 yes 3 left left_low no recurrence-events
194 60-69 ge40 50-54 0-2 no 2 right left_up yes no-recurrence-events
195 40-49 premeno 25-29 0-2 no 2 right left_up no no-recurrence-events
196 30-39 premeno 20-24 3-5 no 2 right central no no-recurrence-events
197 30-39 premeno 30-34 0-2 no 1 right left_up no recurrence-events
198 60-69 lt40 30-34 0-2 no 1 left left_low no no-recurrence-events
199 40-49 premeno 15-19 12-14 no 3 right right_low yes no-recurrence-events
200 60-69 ge40 20-24 0-2 no 3 right left_low no recurrence-events
201 30-39 premeno 5-9 0-2 no 2 left right_low no no-recurrence-events
202 40-49 premeno 30-34 0-2 no 3 left left_up no no-recurrence-events
203 60-69 ge40 30-34 0-2 no 3 left left_low no no-recurrence-events
204 40-49 premeno 25-29 0-2 no 1 right right_low no no-recurrence-events
205 40-49 premeno 25-29 0-2 no 1 left right_low no no-recurrence-events
206 60-69 ge40 40-44 3-5 yes 3 right left_low no recurrence-events
207 50-59 ge40 25-29 0-2 no 2 left left_low no no-recurrence-events
208 50-59 premeno 30-34 0-2 no 3 right left_up yes recurrence-events
209 40-49 ge40 30-34 3-5 no 3 left left_low no recurrence-events
210 40-49 premeno 25-29 0-2 no 1 right left_low yes no-recurrence-events
211 40-49 ge40 25-29 12-14 yes 3 left right_low yes recurrence-events
212 40-49 premeno 40-44 0-2 no 1 left left_low no recurrence-events
213 40-49 premeno 20-24 0-2 no 2 left left_low no no-recurrence-events
214 50-59 ge40 25-29 0-2 no 1 left right_low no no-recurrence-events
215 40-49 premeno 20-24 0-2 no 2 right left_up no no-recurrence-events
216 70-79 ge40 40-44 0-2 no 1 right left_up no no-recurrence-events
217 60-69 ge40 25-29 0-2 no 3 left left_up no recurrence-events
218 50-59 premeno 25-29 0-2 no 2 left left_low no no-recurrence-events
219 60-69 ge40 45-49 0-2 no 1 right right_up yes recurrence-events
220 50-59 ge40 20-24 0-2 yes 2 right left_up no no-recurrence-events
221 50-59 ge40 25-29 0-2 no 1 left left_low no no-recurrence-events
222 50-59 ge40 20-24 0-2 no 3 left left_up no no-recurrence-events
223 40-49 premeno 20-24 3-5 no 2 right left_low no no-recurrence-events
224 50-59 ge40 35-39 0-2 no 2 left left_up no no-recurrence-events
225 30-39 premeno 20-24 0-2 no 3 left left_up yes recurrence-events
226 60-69 ge40 30-34 0-2 no 1 right left_up no no-recurrence-events
227 60-69 ge40 25-29 0-2 no 3 right left_low no no-recurrence-events
228 40-49 ge40 30-34 0-2 no 2 left left_up yes no-recurrence-events
229 30-39 premeno 25-29 0-2 no 2 left left_low no no-recurrence-events
230 40-49 premeno 20-24 0-2 no 2 left left_low no recurrence-events
231 30-39 premeno 20-24 0-2 no 2 left right_low no no-recurrence-events
232 40-49 premeno 10-14 0-2 no 2 right left_low no no-recurrence-events
233 50-59 premeno 15-19 0-2 no 2 right right_low no no-recurrence-events
234 50-59 premeno 25-29 0-2 no 1 right left_up no no-recurrence-events
235 60-69 ge40 20-24 0-2 no 2 right left_up no no-recurrence-events
236 60-69 ge40 40-44 0-2 no 2 right left_low no recurrence-events
237 30-39 lt40 15-19 0-2 no 3 right left_up no no-recurrence-events
238 40-49 premeno 30-34 12-14 yes 3 left left_up yes recurrence-events
239 60-69 ge40 30-34 0-2 yes 2 right right_up yes recurrence-events
240 50-59 ge40 40-44 6-8 yes 3 left left_low yes recurrence-events
241 50-59 ge40 30-34 0-2 no 3 left nan no recurrence-events
242 70-79 ge40 10-14 0-2 no 2 left central no no-recurrence-events
243 30-39 premeno 40-44 0-2 no 2 left left_low yes no-recurrence-events
244 40-49 premeno 30-34 0-2 no 2 right right_low no no-recurrence-events
245 40-49 premeno 30-34 0-2 no 1 left left_low no no-recurrence-events
246 60-69 ge40 15-19 0-2 no 2 left left_low no no-recurrence-events
247 40-49 premeno 10-14 0-2 no 2 left left_low no no-recurrence-events
248 60-69 ge40 20-24 0-2 no 1 left left_low no no-recurrence-events
249 50-59 ge40 10-14 0-2 no 1 left left_up no no-recurrence-events
250 50-59 premeno 25-29 0-2 no 1 left left_low no no-recurrence-events
251 50-59 ge40 30-34 9-11 yes 3 left right_low yes recurrence-events
252 50-59 ge40 10-14 0-2 no 2 left left_low no no-recurrence-events
253 40-49 premeno 30-34 0-2 no 1 left right_up no no-recurrence-events
254 70-79 ge40 0-4 0-2 no 1 left right_low no no-recurrence-events
255 40-49 premeno 25-29 0-2 no 3 right left_up yes no-recurrence-events
256 50-59 premeno 25-29 0-2 no 3 right left_low yes recurrence-events
257 50-59 ge40 40-44 0-2 no 2 left left_low no no-recurrence-events
258 60-69 ge40 25-29 0-2 no 3 left right_low yes recurrence-events
259 40-49 premeno 30-34 3-5 yes 2 right left_low no no-recurrence-events
260 50-59 ge40 20-24 0-2 no 2 left left_up no recurrence-events
261 70-79 ge40 20-24 0-2 no 3 left left_up no no-recurrence-events
262 30-39 premeno 25-29 0-2 no 1 left central no no-recurrence-events
263 60-69 ge40 30-34 0-2 no 2 left left_low no no-recurrence-events
264 40-49 premeno 20-24 3-5 yes 2 right right_up yes recurrence-events
265 50-59 ge40 30-34 9-11 nan 3 left left_low yes no-recurrence-events
266 50-59 ge40 0-4 0-2 no 2 left central no no-recurrence-events
267 40-49 premeno 20-24 0-2 no 3 right left_low yes no-recurrence-events
268 30-39 premeno 35-39 0-2 no 3 left left_low no recurrence-events
269 60-69 ge40 30-34 0-2 no 1 left left_up no no-recurrence-events
270 60-69 ge40 20-24 0-2 no 1 left left_low no no-recurrence-events
271 50-59 ge40 25-29 6-8 no 3 left left_low yes recurrence-events
272 50-59 premeno 35-39 15-17 yes 3 right right_up no recurrence-events
273 30-39 premeno 20-24 3-5 yes 2 right left_up yes no-recurrence-events
274 40-49 premeno 20-24 6-8 no 2 right left_low yes no-recurrence-events
275 50-59 ge40 35-39 0-2 no 3 left left_low no no-recurrence-events
276 50-59 premeno 35-39 0-2 no 2 right left_up no no-recurrence-events
277 40-49 premeno 25-29 0-2 no 2 left left_up yes no-recurrence-events
278 40-49 premeno 35-39 0-2 no 2 right right_up no no-recurrence-events
279 50-59 premeno 30-34 3-5 yes 2 left left_low yes no-recurrence-events
280 40-49 premeno 20-24 0-2 no 2 right right_up no no-recurrence-events
281 60-69 ge40 15-19 0-2 no 3 right left_up yes no-recurrence-events
282 50-59 ge40 30-34 6-8 yes 2 left left_low no no-recurrence-events
283 50-59 premeno 25-29 3-5 yes 2 left left_low yes no-recurrence-events
284 30-39 premeno 30-34 6-8 yes 2 right right_up no no-recurrence-events
285 50-59 premeno 15-19 0-2 no 2 right left_low no no-recurrence-events
286 50-59 ge40 40-44 0-2 no 3 left right_up no no-recurrence-events

73
one_exercise_per_file/week02/day03/ex05/data/breast_cancer_readme.txt

@ -0,0 +1,73 @@
Citation Request:
This breast cancer domain was obtained from the University Medical Centre,
Institute of Oncology, Ljubljana, Yugoslavia. Thanks go to M. Zwitter and
M. Soklic for providing the data. Please include this citation if you plan
to use this database.
1. Title: Breast cancer data (Michalski has used this)
2. Sources:
-- Matjaz Zwitter & Milan Soklic (physicians)
Institute of Oncology
University Medical Center
Ljubljana, Yugoslavia
-- Donors: Ming Tan and Jeff Schlimmer (Jeffrey.Schlimmer@a.gp.cs.cmu.edu)
-- Date: 11 July 1988
3. Past Usage: (Several: here are some)
-- Michalski,R.S., Mozetic,I., Hong,J., & Lavrac,N. (1986). The
Multi-Purpose Incremental Learning System AQ15 and its Testing
Application to Three Medical Domains. In Proceedings of the
Fifth National Conference on Artificial Intelligence, 1041-1045,
Philadelphia, PA: Morgan Kaufmann.
-- accuracy range: 66%-72%
-- Clark,P. & Niblett,T. (1987). Induction in Noisy Domains. In
Progress in Machine Learning (from the Proceedings of the 2nd
European Working Session on Learning), 11-30, Bled,
Yugoslavia: Sigma Press.
-- 8 test results given: 65%-72% accuracy range
-- Tan, M., & Eshelman, L. (1988). Using weighted networks to
represent classification knowledge in noisy domains. Proceedings
of the Fifth International Conference on Machine Learning, 121-134,
Ann Arbor, MI.
-- 4 systems tested: accuracy range was 68%-73.5%
-- Cestnik,G., Konenenko,I, & Bratko,I. (1987). Assistant-86: A
Knowledge-Elicitation Tool for Sophisticated Users. In I.Bratko
& N.Lavrac (Eds.) Progress in Machine Learning, 31-45, Sigma Press.
-- Assistant-86: 78% accuracy
4. Relevant Information:
This is one of three domains provided by the Oncology Institute
that has repeatedly appeared in the machine learning literature.
(See also lymphography and primary-tumor.)
This data set includes 201 instances of one class and 85 instances of
another class. The instances are described by 9 attributes, some of
which are linear and some are nominal.
5. Number of Instances: 286
6. Number of Attributes: 9 + the class attribute
7. Attribute Information:
1. Class: no-recurrence-events, recurrence-events
2. age: 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99.
3. menopause: lt40, ge40, premeno.
4. tumor-size: 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44,
45-49, 50-54, 55-59.
5. inv-nodes: 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26,
27-29, 30-32, 33-35, 36-39.
6. node-caps: yes, no.
7. deg-malig: 1, 2, 3.
8. breast: left, right.
9. breast-quad: left-up, left-low, right-up, right-low, central.
10. irradiat: yes, no.
8. Missing Attribute Values: (denoted by "?")
Attribute #: Number of instances with missing values:
6. 8
9. 1.
9. Class Distribution:
1. no-recurrence-events: 201 instances
2. recurrence-events: 85 instances

0
one_exercise_per_file/week02/day04/audit/readme.md

1
one_exercise_per_file/week02/day04/ex01/audit/readme.md

@ -0,0 +1 @@
1. This question is validated if the MSE outputted is **2.25**.

10
one_exercise_per_file/week02/day04/ex01/readme.md

@ -0,0 +1,10 @@
# Exercise 1 MSE Scikit-learn
The goal of this exercise is to learn to use `sklearn.metrics` to compute the mean squared error (MSE).
1. Compute the MSE using `sklearn.metrics` on `y_true` and `y_pred` below:
```python
y_true = [91, 51, 2.5, 2, -5]
y_pred = [90, 48, 2, 2, -4]
```

1
one_exercise_per_file/week02/day04/ex02/audit/readme.md

@ -0,0 +1 @@
1. This question is validated if the accuracy outputted is **0.5714285714285714**.

10
one_exercise_per_file/week02/day04/ex02/readme.md

@ -0,0 +1,10 @@
# Exercise 2 Accuracy Scikit-learn
The goal of this exercise is to learn to use `sklearn.metrics` to compute the accuracy.
1. Compute the accuracy using `sklearn.metrics` on `y_true` and `y_pred` below:
```python
y_pred = [0, 1, 0, 1, 0, 1, 0]
y_true = [0, 0, 1, 1, 1, 1, 0]
```

28
one_exercise_per_file/week02/day04/ex03/audit/readme.md

@ -0,0 +1,28 @@
1. This question is validated if the predictions on the train set and test set are:
```console
# 10 first values Train
array([1.54505951, 2.21338527, 2.2636205 , 3.3258957 , 1.51710076,
1.63209319, 2.9265211 , 0.78080924, 1.21968217, 0.72656239])
```
```console
#10 first values Test
array([ 1.82212706, 1.98357668, 0.80547979, -0.19259114, 1.76072418,
3.27855815, 2.12056804, 1.96099917, 2.38239663, 1.21005304])
```
2. This question is validated if the results match this output:
```console
r2 on the train set: 0.3552292936915783
MAE on the train set: 0.5300159371615256
MSE on the train set: 0.5210784446797679
r2 on the test set: 0.30265471284464673
MAE on the test set: 0.5454023699809112
MSE on the test set: 0.5537420654727396
```
This result shows that the model has slightly better results on the train set than the test set. That's frequent since it is easier to get a better grade on an exam we studied than an exam that is different from what was prepared. However, the results are not good: r2 ~ 0.3. Fitting non linear models as the Random Forest on this data may improve the results. That's the goal of the exercise 5.

37
one_exercise_per_file/week02/day04/ex03/readme.md

@ -0,0 +1,37 @@
# Exercise 3 Regression
The goal of this exercise is to learn to evaluate a machine learning model using many regression metrics.
Preliminary:
- Import California Housing data set and split it in a train set and a test set (10%). Fit a linear regression on the data set. *The goal is focus on the metrics, that is why the code to fit the Linear Regression is given.*
```python
# imports
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
# data
housing = fetch_california_housing()
X, y = housing['data'], housing['target']
# split data train test
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.1,
shuffle=True,
random_state=13)
# pipeline
pipeline = [('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
('lr', LinearRegression())]
pipe = Pipeline(pipeline)
# fit
pipe.fit(X_train, y_train)
```
1. Predict on the train set and test set
2. Compute R2, Mean Square Error, Mean Absolute Error on both train and test set

41
one_exercise_per_file/week02/day04/ex04/audit/readme.md

@ -0,0 +1,41 @@
1. This question is validated if the predictions on the train set and test set are:
```console
# 10 first values Train
array([1, 0, 1, 1, 1, 0, 0, 1, 1, 0])
# 10 first values Test
array([1, 1, 0, 0, 0, 1, 1, 1, 0, 0])
```
2. This question is validated if the results match this output:
```console
F1 on the train set: 0.9911504424778761
Accuracy on the train set: 0.989010989010989
Recall on the train set: 0.9929078014184397
Precision on the train set: 0.9893992932862191
ROC_AUC on the train set: 0.9990161111794368
F1 on the test set: 0.9801324503311258
Accuracy on the test set: 0.9736842105263158
Recall on the test set: 0.9866666666666667
Precision on the test set: 0.9736842105263158
ROC_AUC on the test set: 0.9863247863247864
```
The confusion matrix on the test set should be:
```console
array([[37, 2],
[ 1, 74]])
```
3. The ROC AUC plot should look like:
![alt text][logo_ex4]
[logo_ex4]: ../images/w2_day4_ex4_q3.png "ROC AUC "
Having a 99% ROC AUC is not usual. The data set we used is easy to classify. On real data sets, always check if there's any leakage while having such a high ROC AUC score.

BIN
one_exercise_per_file/week02/day04/ex04/images/w2_day4_ex4_q3.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 36 KiB

36
one_exercise_per_file/week02/day04/ex04/readme.md

@ -0,0 +1,36 @@
# Exercise 4 Classification
The goal of this exercise is to learn to evaluate a machine learning model using many classification metrics.
Preliminary:
- Import Breast Cancer data set and split it in a train set and a test set (20%). Fit a logistic regression on the data set. *The goal is focus on the metrics, that is why the code to fit the logistic Regression is given.*
```python
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X , y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
classifier = LogisticRegression()
classifier.fit(X_train_scaled, y_train)
```
1. Predict on the train set and test set
2. Compute F1, accuracy, precision, recall, roc_auc scores on the train set and test set. Print the confusion matrix on the test set results.
**Note: AUC can only be computed on probabilities, not on classes.**
3. Plot the AUC curve for on the test set using roc_curve of scikit learn. There many ways to create this plot. It should look like this:
![alt text][logo_ex4]
[logo_ex4]: images/w2_day4_ex4_q3.png "ROC AUC "
- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_roc_curve.html

72
one_exercise_per_file/week02/day04/ex05/audit/readme.md

@ -0,0 +1,72 @@
1. Some of the algorithms use random steps (random sampling used by the `RandomForest`). I used `random_state = 43` for the Random Forest, the Decision Tree and the Gradient Boosting. This question is validated of the scores you got are close to:
```console
# Linear regression
TRAIN
r2 on the train set: 0.34823544284172625
MAE on the train set: 0.533092001261455
MSE on the train set: 0.5273648371379568
TEST
r2 on the test set: 0.3551785428138914
MAE on the test set: 0.5196420310323713
MSE on the test set: 0.49761195027083804
# SVM
TRAIN
r2 on the train set: 0.6462366150965996
MAE on the train set: 0.38356451633259875
MSE on the train set: 0.33464478671339165
TEST
r2 on the test set: 0.6162644671183826
MAE on the test set: 0.3897680598426786
MSE on the test set: 0.3477101776543003
# Decision Tree
TRAIN
r2 on the train set: 0.9999999999999488
MAE on the train set: 1.3685733933909677e-08
MSE on the train set: 6.842866883530944e-14
TEST
r2 on the test set: 0.6263651902480918
MAE on the test set: 0.4383758696244002
MSE on the test set: 0.4727017198871596
# Random Forest
TRAIN
r2 on the train set: 0.9705418471542886
MAE on the train set: 0.11983836612191189
MSE on the train set: 0.034538356420577995
TEST
r2 on the test set: 0.7504673649554309
MAE on the test set: 0.31889891600404635
MSE on the test set: 0.24096164834441108
# Gradient Boosting
TRAIN
r2 on the train set: 0.7395782392433273
MAE on the train set: 0.35656543036682264
MSE on the train set: 0.26167490389525294
TEST
r2 on the test set: 0.7157456298013534
MAE on the test set: 0.36455447680396397
MSE on the test set: 0.27058170064218096
```
It is important to notice that the Decision Tree over fits very easily. It learns easily the training data but is not able to extrapolate on the test set. This algorithm is not used a lot.
However, Random Forest and Gradient Boosting propose a solid approach to correct the over fitting (in that case the parameters `max_depth` is set to None that is why the Random Forest over fits the data). These two algorithms are used intensively in Machine Learning Projects.

55
one_exercise_per_file/week02/day04/ex05/readme.md

@ -0,0 +1,55 @@
# Exercise 5 Machine Learning models
The goal of this exercise is to have an overview of the existing Machine Learning models and to learn to call them from scikit learn.
We will focus on:
- SVM/SVC
- Decision Tree
- Random Forest (Ensemble learning)
- Gradient Boosting (Ensemble learning, Boosting techniques)
All these algorithms exist in two versions: regression and classification. Even if the logic is similar in both classification and regression, the loss function is specific to each case.
It is really easy to get lost among all the existing algorithms. This article is very useful to have a clear overview of the models and to understand which algorithm use and when. https://towardsdatascience.com/how-to-choose-the-right-machine-learning-algorithm-for-your-application-1e36c32400b9
Preliminary:
- Import California Housing data set and split it in a train set and a test set (10%). Fit a linear regression on the data set. *The goal is to focus on the metrics, that is why the code to fit the Linear Regression is given.*
```python
# imports
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
# data
housing = fetch_california_housing()
X, y = housing['data'], housing['target']
# split data train test
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.1,
shuffle=True,
random_state=43)
# pipeline
pipeline = [('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
('lr', LinearRegression())]
pipe = Pipeline(pipeline)
# fit
pipe.fit(X_train, y_train)
```
1. Create 5 pipelines with 5 different models as final estimator (keep the imputer and scaler unchanged):
1. Linear Regression
2. SVM
3. Decision Tree (set `random_state=43`)
4. Random Forest (set `random_state=43`)
5. Gradient Boosting (set `random_state=43`)
Take time to have basic understanding of the role of the basic hyperparameter and their default value.
- For each algorithm, print the R2, MSE and MAE on both train set and test set.

31
one_exercise_per_file/week02/day04/ex06/audit/readme.md

@ -0,0 +1,31 @@
1. This question is validated if the code that runs the `gridsearch` is (the parameters may change):
```python
parameters = {'n_estimators':[10, 50, 75],
'max_depth':[3,5,7],
'min_samples_leaf': [10,20,30]}
rf = RandomForestRegressor()
gridsearch = GridSearchCV(rf,
parameters,
cv = [(np.arange(18576), np.arange(18576,20640))],
n_jobs=-1)
gridsearch.fit(X, y)
```
2. This question is validated if the function is:
```python
def select_model_verbose(gs):
return gs.best_estimator_, gs.best_params_, gs.best_score_
```
In my case, the `gridsearch` parameters are not interesting. Even if I reduced the over fitting of the Random Forest, the score on the test is lower than the score on the test returned by the Gradient Boosting in the previous exercise without optimal parameters search.
3. This question is validated if the code used is:
```python
model, best_params, best_score = select_model_verbose(gridsearch)
model.predict(new_point)
```

51
one_exercise_per_file/week02/day04/ex06/readme.md

@ -0,0 +1,51 @@
# Exercise 6 Grid Search
The goal of this exercise is to learn how to make an exhaustive search over specified parameter values for an estimator. This is very useful because the hyperparameter which are the parameters of the model impact the performance of the model.
The scikit learn object that runs the Grid Search is called GridSearchCV. We will learn tomorrow about the cross validation. For now, let us set the parameter **cv** to `[(np.arange(18576), np.arange(18576,20640))]`.
This means that GridSearchCV splits the data set in a train and test set.
Preliminary:
- Load the California Housing data set. As precised, this time, there's no need to split the data set in train set and test set since GridSearchCV does it.
You will have to run a Grid Search on the Random Forest on at least the hyperparameter that are mentioned below. It doesn't mean these are the only hyperparameter of the model. If possible, try at least 3 different values for each hyperparameter.
1. Run a Grid Search with `n_jobs` set to `-1` to parallelize the computations on all CPUs. The hyperparameter to change are: n_estimators, max_depth, min_samples_leaf. It may take
Now, let us analyse the grid search's results in order to select the best model.
2. Write a function that takes as input the Grid Search object and that returns the best model **fitted**, the best set of hyperparameter and the associated score:
```python
def select_model_verbose(gs):
return trained_model, best_params, best_score
```
3. Use the trained model to predict on a new point:
```python
new_point = np.array([[3.2031, 52., 5.47761194, 1.07960199, 910., 2.26368159, 37.85, -122.26]])
```
How do we know the best model returned by GridSearchCV is good enough and stable ? That is what we will learn tomorrow !
**WARNING: Some combinations of hyper parameters are not possible. For example using the SVM, the kernel linear has no parameter gamma.**
**Note**:
- GridSearchCV can also take a Pipeline instead of a Machine Learning model. It is useful to combine some Imputers or Dimension reduction techniques with some Machine Learning models in the same Pipeline.
- It may be useful to check on Kaggle if some Kagglers share their Grid Searches.
Ressources:
- https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
- https://stackoverflow.com/questions/38555650/try-multiple-estimator-in-one-grid-search
- https://medium.com/fintechexplained/what-is-grid-search-c01fe886ef0a
- https://elutins.medium.com/grid-searching-in-machine-learning-quick-explanation-and-python-implementation-550552200596
- https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html

34
one_exercise_per_file/week02/day04/readme.md

@ -0,0 +1,34 @@
# D04 Piscine AI - Data Science
# Table of Contents:
# Introduction
Today we will learn how to choose the right Machine Learning metric depending on the problem you are solving and to compute it. A metric gives an idea of how good the model performs. Depending on working on a classification problem or a regression problem the metrics considered are different. It is important to understand that all metrics are just metrics, not the truth.
We will focus on the most important metrics:
- Regression:
- **R2**, **Mean Square Error**, **Mean Absolute Error**
- Classification:
- **F1 score**, **accuracy**, **precision**, **recall** and **AUC scores**. Even if it not considered as a metric, the **confusion matrix** is always useful to understand the model performance.
Warning: **Imbalanced data set**
Let us assume we are predicting a rare event that occurs less than 2% of the time. Having a model that scores a good accuracy is easy, it doesn't have to be "smart", all it has to do is to always predict the majority class. Depending on the problem it can be disastrous. For example, working with real life data, breast cancer prediction is an imbalanced problem where predicting the majority leads to disastrous consequences. That is why metrics as AUC are useful.
- https://stats.stackexchange.com/questions/260164/auc-and-class-imbalance-in-training-test-dataset
Before to compute the metrics, read carefully this article to understand the role of these metrics.
- https://www.kdnuggets.com/2018/06/right-metric-evaluating-machine-learning-models-2.html
+ ML models + GS
## Historical
## Rules
## Resources
- https://scikit-learn.org/stable/modules/model_evaluation.html

0
one_exercise_per_file/week02/day05/audit/readme.md

18
one_exercise_per_file/week02/day05/ex01/audit/readme.md

@ -0,0 +1,18 @@
1. This question is validated if the output of the 5-fold cross validation is:
```console
Fold: 1
TRAIN: [2 3 4 5 6 7 8 9] TEST: [0 1]
Fold: 2
TRAIN: [0 1 4 5 6 7 8 9] TEST: [2 3]
Fold: 3
TRAIN: [0 1 2 3 6 7 8 9] TEST: [4 5]
Fold: 4
TRAIN: [0 1 2 3 4 5 8 9] TEST: [6 7]
Fold: 5
TRAIN: [0 1 2 3 4 5 6 7] TEST: [8 9]
```

27
one_exercise_per_file/week02/day05/ex01/readme.md

@ -0,0 +1,27 @@
# Exercise 1: K-Fold
The goal of this exercise is to learn to use `KFold` to split the data set in a k-fold cross validation. Most of the time you won't use this function to split your data because this function is used by others as `cross_val_score` or `cross_validate` or `GridSearchCV` ... . But, this allows to understand the splitting and to create a custom one if needed.
```python
X = np.array(np.arange(1,21).reshape(10,-1))
y = np.array(np.arange(1,11))
```
1. Using `KFold`, perform a 5-fold cross validation. For each fold, print the train index and test index. The expected output is:
```console
Fold: 1
TRAIN: [2 3 4 5 6 7 8 9] TEST: [0 1]
Fold: 2
TRAIN: [0 1 4 5 6 7 8 9] TEST: [2 3]
Fold: 3
TRAIN: [0 1 2 3 6 7 8 9] TEST: [4 5]
Fold: 4
TRAIN: [0 1 2 3 4 5 8 9] TEST: [6 7]
Fold: 5
TRAIN: [0 1 2 3 4 5 6 7] TEST: [8 9]
```

16
one_exercise_per_file/week02/day05/ex02/audit/readme.md

@ -0,0 +1,16 @@
1. This question is validated if the output is:
```console
Scores on validation sets:
[0.62433594 0.61648956 0.62486602 0.59891024 0.59284295 0.61307055
0.54630341 0.60742976 0.60014575 0.59574508]
Mean of scores on validation sets:
0.60201392526743
Standard deviation of scores on validation sets:
0.0214983822773466
```
The model is consistent across folds: it is stable. That's a first sign that the model is not over fitted. The average R2 is 60% that's a good start ! To be improved.

53
one_exercise_per_file/week02/day05/ex02/readme.md

@ -0,0 +1,53 @@
# Exercise 2: Cross validation (k-fold)
The goal of this exercise is to learn how to use cross validation. After reading the articles you should be able to explain why we need to cross-validate the models. We will firstly focus on Linear Regression to reduce the computation time. We will be using `cross_validate` to run the cross validation. Note that `cross_val_score` is similar but the `cross_validate` calculates one or more scores and timings for each CV split.
Preliminary:
- Import California Housing data set and split it in a train set and a test set (10%). Fit a linear regression on the data set. *The goal is to focus on the cross validation, that is why the code to fit the Linear Regression is given.*
```python
# imports
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
# data
housing = fetch_california_housing()
X, y = housing['data'], housing['target']
# split data train test
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.1,
shuffle=True,
random_state=43)
# pipeline
pipeline = [('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
('lr', LinearRegression())]
pipe = Pipeline(pipeline)
```
1. Cross validate the Pipeline using `cross_validate` with 10 folds. Print the scores on each validation sets, the mean score on the validation sets and the standard deviation on the validation sets. The expected output is:
```console
Scores on validation sets:
[0.62433594 0.61648956 0.62486602 0.59891024 0.59284295 0.61307055
0.54630341 0.60742976 0.60014575 0.59574508]
Mean of scores on validation sets:
0.60201392526743
Standard deviation of scores on validation sets:
0.0214983822773466
```
**Note: It may be confusing that the key of the dictionary that returns the results on the validation sets is `test_score`. Sometimes, the validation sets are called test sets. In that case, we run the cross validation on X_train. It means that the scores are computed on sets in the initial train set. The X_test is not used for the cross-validation.**
- https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html
- https://machinelearningmastery.com/how-to-configure-k-fold-cross-validation/

33
one_exercise_per_file/week02/day05/ex03/audit/readme.md

@ -0,0 +1,33 @@
1. This question is validated if the code that runs the grid search is similar to:
```python
parameters = {'n_estimators':[10, 50, 75],
'max_depth':[4, 7, 10]}
rf = RandomForestRegressor()
gridsearch = GridSearchCV(rf,
parameters,
cv = 5,
n_jobs=-1,
scoring='neg_mean_squared_error')
gridsearch.fit(X_train, y_train)
```
The answers that uses another list of parameters are accepted too !
2. This question is validated if you called this attributes:
```python
print(gridsearch.best_score_)
print(gridsearch.best_params_)
print(gridsearch.cv_results_)
```
The best score is -0.29028202683007526, that means that the MSE is ~0.29, it doesn't give any information since this metric is arbitrary. This score is the average of `neg_mean_squared_error` on all the validation sets.
The best models params are `{'max_depth': 10, 'n_estimators': 75}`.
As you may must have a different parameters list than this one, you should have different results.
3. This question is validated if you used the fitted estimator to compute the score on the test set: `gridsearch.score(X_test, y_test)`. The MSE score is ~0.27. The score I got on the test set is close to the score I got on the validation sets. It means the models is not over fitted.

49
one_exercise_per_file/week02/day05/ex03/readme.md

@ -0,0 +1,49 @@
# Exercise 3 GridsearchCV
The goal of this exercise is to learn to use GridSearchCV to run a grid search, predict on the test set and score on the test set.
Preliminary:
- Import California Housing data set and split it in a train set and a test set (10%). Fit a linear regression on the data set. *The goal is to focus on the gridsearch, that is why the code to fit the Linear Regression is given.*
```python
# imports
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
# data
housing = fetch_california_housing()
X, y = housing['data'], housing['target']
# split data train test
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.1,
shuffle=True,
random_state=43)
# pipeline
pipeline = [('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
('lr', LinearRegression())]
pipe = Pipeline(pipeline)
```
1. Run `GridSearchCV` on all CPUs with 5 folds, MSE as score, Random Forest as model with:
- max_depth between 1 and 20 (at least 3 values)
- n_estimators between 1 and 100 (at least 3 values)
This may take few minutes to run.
*Hint*: The name of the metric to put in the parameter `scoring` is `neg_mean_squared_error`. The smaller the MSE is, the better the model is. At the contrary, The greater the R2 is the better the model is. `GridSearchCV` chooses the best model by selecting the one that maximized the score on the validation sets. And, in mathematic, maximizing a function or minimizing its opposite is equivalent. More details:
- https://stackoverflow.com/questions/21443865/scikit-learn-cross-validation-negative-values-with-mean-squared-error
2. Extract the best fitted estimator, print its params, print its score on the validation set and print `cv_results_`.
3. Compute the score the test set.
**WARNING: If the score used in classification is the AUC, there is one rare case where the AUC may return an error or a warning: The fold contains only one class. In that case it can't be computed, by definition.**

27
one_exercise_per_file/week02/day05/ex04/audit/readme.md

@ -0,0 +1,27 @@
1. This question is validated if the outputted plot looks like:
![alt text][logo_ex5q1]
[logo_ex5q1]: ../images/w2_day5_ex5_q1.png "Validation curve "
The code that generated the data in the plot is:
```python
from sklearn.model_selection import validation_curve
clf = RandomForestClassifier()
param_range = np.arange(1,30,2)
train_scores, test_scores = validation_curve(clf,
X,
y,
param_name="max_depth",
param_range=param_range,
scoring="roc_auc",
n_jobs=-1)
```
2. This question is validated if the output is
![alt text][logo_ex5q2]
[logo_ex5q2]: ../images/w2_day5_ex5_q2.png "Learning curve "

BIN
one_exercise_per_file/week02/day05/ex04/images/w2_day5_ex5_q1.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 66 KiB

BIN
one_exercise_per_file/week02/day05/ex04/images/w2_day5_ex5_q2.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 176 KiB

58
one_exercise_per_file/week02/day05/ex04/readme.md

@ -0,0 +1,58 @@
# Exercise 4 Validation curve and Learning curve
The goal of this exercise is to learn to analyse the model's performance with two tools:
- Validation curve
- Learning curve
For this exercise we will use a dataset of 100k data points to give you an idea of the computation time you can expect during projects.
Preliminary:
- Using make_classification from sklearn, generate a binary data set with 100k data points and with 30 features.
```python
X, y = make_classification(n_samples=100000,
n_features= 30,
n_informative=10,
flip_y=0.2 )
```
1. Plot the validation curve, using all CPUs, with 5 folds. The goal is to focus again on max_depth between 1 and 20.
You may need to increase the window (example: between 1 and 50 ) if you notice that other values of max_depth could have returned better results. This may take few minutes.
I do not expect that you implement all the plot from scratch, you'd better leverage the code here:
- https://scikit-learn.org/stable/auto_examples/model_selection/plot_validation_curve
The plot should look like this:
![alt text][logo_ex5q1]
[logo_ex5q1]: images/w2_day5_ex5_q1.png "Validation curve "
The interpretation is that from max_depth=10, the train score keeps increasing but the test score (or validation score) reaches a plateau. It means that choosing max_depth = 20 may lead to have an over fitted model.
Note: Given the time computation is is not possible to plot the validation curve for all parameters. It is useful to plot it for parameters that control the over fitting the most.
More details:
- https://chrisalbon.com/machine_learning/model_evaluation/plot_the_validation_curve/
2. Let us assume the gridsearch returned `clf = RandomForestClassifier(max_depth=12)`. Let's check if the models under fits, over fit or fits correctly. Plot the learning curve. These two resources will help you a lot to understand how to analyse the learning curves and how to plot them:
- https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/
- https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html#sphx-glr-auto-examples-model-selection-plot-learning-curve-py
- **Re-use the function in the second resource**, change the cross validation to a classic 10-folds, run the learning curve data computation on all CPUs and plot the three plots as shown below.
![alt text][logo_ex5q2]
[logo_ex5q2]: images/w2_day5_ex5_q2.png "Learning curve "
- **Note Plot Learning Curves**: The learning curves is detailed in the first resource.
- **Note Plot Scalibility of the model**: This plot shows the relationship between the time to train the model and the number of rows in the data. In that case the relationship is linear.
- **Note Performance of the model**: This plot shows wether it worths increasing the training time by adding data to increase the score. It would worth to add data to increase the score if the curve hasn't reach a plateau yet. In that case, increasing the training time by 10 units increases the score by less than 0.001.

26
one_exercise_per_file/week02/day05/readme.md

@ -0,0 +1,26 @@
# D05 Piscine AI - Data Science
# Table of Contents:
# Introduction
If you finished yesterday's exercises you should be able to train several Machine Learning algorithms and to choose one returned by GridSearchCV.
GridSearchCV returns the model that gives the best score on the test set. Yesterday, as I told you, I changed the **cv** parameter to compute the GridSearch with a train set and a test set.
It means that the selected model is based on one single measure. What if, by luck, we predict correctly on that section ? What if the best model is bad ? What if I could have selected a better model ?
We will answer these questions today ! The topics we will cover are the one of the most important in Machine Learning.
Must read before to start the exercises:
- Biais-Variance trade off; aka Underfitting/Overfitting:
- https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine-learning/
- https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html
- Cross-validation
- https://algotrading101.com/learn/train-test-split/
## Rules
## Resources

111
one_exercise_per_file/week02/raid02/audit/readme.md

@ -0,0 +1,111 @@
# Forest Cover Type Prediction - Correction
The goal of this project is to use cartographic variables to classify forest categories. You will have to analyse the data, create features and to train a machine learning model on the cartographic data to make it as accurate as possible.
## Problem
The expected structure of the project is:
```
project
│ README.md
│ environment.yml
└───data
│ │ train.csv
│ | test.csv (not available first day)
| | covtype.info
└───notebook
│ │ EDA.ipynb
|
|───scripts
| │ preprocessing_feature_engineering.py
| │ model_selection.py
│ | predict.py
└───results
│ confusion_matrix_heatmap.png
│ learning_curve_best_model.png
│ test_predictions.csv
│ best_model.pkl
```
- The readme file contains a description of the project and explains how to run the code from an empty environment. It also gives a summary of the implementation of each python file. The preprocessing which is a key part should be decribed precisely. Finally, it should contain a conclusion that gives the performance of the strategy.
- The environment has to contain all libraries used and their versions that are necessary to run the code.
- The notebook is not evaluated.
## 1. Preprocessing and features engineering:
## 2. Model selection and predict
### Data splitting
The data splitting structure is:
```
DATA
└───TRAIN FILE (0)
│ └───── Train (1):
│ | Fold0:
| | Train
| | Validation
| | Fold1:
| | Train
| | Validation
... ... ...
| |
| └───── Test (1)
└─── TEST FILE (0)(available last day)
```
- The train set (0) id divised in a train set (1) and test set (1). The ratio is less than 33%.
- The cross validation splits the train set (1) is at least 5 folds. If the cross validation is stratified that's a good point but it is not a requirement.
### Gridsearch
- It contains at least these 5 different models:
- Gradient Boosting, KNN, Random Forest, SVM, Logistic Regression.
There are many options:
- 5 grid searches on 1 model
- 1 grid search on 5 models
- 1 grid search on a pipeline that contains the preprocessing
- 5 grid searches on a pipeline that contains the preprocessing
### Training
- Check that the **target is removed from the X** matrix
### Results
Run predict.py on the test set, check that:
- Test (last day) accuracy > **0.65**.
Then, check:
- Train accuracy score < **0.98**. It can be checked on the learning curve. If you are not sure, load the model, load the training set (0), score on the training set (0).
- The confusion matrix is represented as a DataFrame. Example:
![alt text][confusion_matrix]
[confusion_matrix]: ../images/w2_weekend_confusion_matrix.png "Confusion matrix "
- The learning curve for the best model is plotted. Example:
![alt text][logo_learning_curve]
[logo_learning_curve]: ../images/w2_weekend_learning_curve.png "Learning curve "
Note: The green line on the plot shows the accuracy on the validation set not on the test set (1) and not on the test set (0).
- The trained model is saved as a pickle file

BIN
one_exercise_per_file/week02/raid02/images/w2_weekend_confusion_matrix.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 53 KiB

BIN
one_exercise_per_file/week02/raid02/images/w2_weekend_learning_curve.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 79 KiB

101
one_exercise_per_file/week02/raid02/readme.md

@ -0,0 +1,101 @@
# Forest Cover Type Prediction
The goal of this project is to use cartographic variables to classify forest categories. You will have to analyse the data, create features and to train a machine learning model on the cartographic data to make it as accurate as possible.
## Data
The input files are `train.csv`, `test.csv` and `covtype.data`:
- `train.csv`
- `test.csv`
- `covtype.info`
The train data set is used to **analyse the data and calibrate the models**. The goal is to get the accuracy as high as possible on the test set. The test set will be available at the end of the last day to prevent from the overfitting of the test set.
The data is described in `covtype.info`.
## Structure
The structure of the project is:
```console
project
│ README.md
│ environment.yml
└───data
│ │ train.csv
│ | test.csv (not available first day)
| | covtype.info
└───notebook
│ │ EDA.ipynb
|
|───scripts
| │ preprocessing_feature_engineering.py
| │ model_selection.py
│ | predict.py
└───results
│ plots
│ test_predictions.csv
│ best_model.pkl
```
## 1. EDA and feature engineering
- Create a Jupyter Notebook to analyse the data sets and perform EDA (Exploratory Data Analysis). This notebook will not be evaluated.
- *Hint: Examples of interesting features*
- `Distance to hydrology = sqrt((Horizontal_Distance_To_Hydrology)^2 + (Vertical_Distance_To_Hydrology)^2)`
- `Horizontal_Distance_To_Fire_Points - Horizontal_Distance_To_Roadways`
## 2. Model Selection
The model selection approach is a key step because, t should return the best model and guaranty that the results are reproducible on the final test set. The goal of this step is to make sure that the results on the test set are not due to test set overfitting. It implies to split the data set as shown below:
```console
DATA
└───TRAIN FILE (0)
│ └───── Train (1)
│ | Fold0:
| | Train
| | Validation
| | Fold1:
| | Train
| | Validation
... ... ...
| |
| └───── Test (1)
└─── TEST FILE (0) (available last day)
```
**Rules:**
- Split train test
- Cross validation: at least 5 folds
- Grid search on at least 5 different models:
- Gradient Boosting, KNN, Random Forest, SVM, Logistic Regression. *Remember that for some model scaling the data is important and for others it doesn't matter.*
- Train accuracy score < **0.98**. Train set (0). Write the result in the `README.md`
- Test (last day) accuracy > **0.65**. Test set (0). Write the result in the `README.md`
- Display the confusion matrix for the best model in a DataFrame. Precise the index and columns names (True label and Predicted label)
- Plot the learning curve for the best model
- Save the trained model as a [pickle](https://www.datacamp.com/community/tutorials/pickle-python-tutorial) file
> Advice: As the grid search takes time, I suggest to prepare and test the code. Once you are confident it works, run the gridsearch at night and analyse the results
**Hint**: The confusion matrix shows the misclassifications class per class. Try to detect if the model misclassifies badly one class with another. Then, do some research on the internet on the two forest cover types, find the differences and create some new features that underline these differences. More generally, the methodology of a models learning is a cycle with several iterations. More details [here](https://serokell.io/blog/machine-learning-testing)
## 3. Predict (last day)
Once you have selected the best model and you are confident it will perform well on new data, you're ready to predict on the test set:
- Load the trained model
- Predict on the test set and compute the accuracy
- Save the predictions in a csv file
- Add your score on the `README.md`
Loading…
Cancel
Save