diff --git a/one_exercise_per_file/week01/raid01/audit/readme.md b/one_exercise_per_file/week01/raid01/audit/readme.md index 62768f1..697f357 100644 --- a/one_exercise_per_file/week01/raid01/audit/readme.md +++ b/one_exercise_per_file/week01/raid01/audit/readme.md @@ -1,3 +1,4 @@ +# RAID01 - Backtesting on the SP500 - correction ``` project diff --git a/one_exercise_per_file/week01/raid01/readme.md b/one_exercise_per_file/week01/raid01/readme.md index 4155945..0d6bad7 100644 --- a/one_exercise_per_file/week01/raid01/readme.md +++ b/one_exercise_per_file/week01/raid01/readme.md @@ -1,44 +1,41 @@ -# D0607 Piscine AI - Data Science +# RAID01 - Backtesting on the SP500 -## SP data preprocessing - -The goal of this project is to perform a Backtest on the SP500 constituents. The SP500 is an index the 500 biggest capitalization in the US. +## SP500 data preprocessing +The goal of this project is to perform a Backtest on the SP500 constituents. The SP500 is an index the 500 biggest capitalization in the US. ## Data The input file are `stock_prices.csv` and : - - `sp500.csv` contains the SP500 data. The SP500 is a stock market index that measures the stock performance of 500 large companies listed on stock exchanges in the United States. - - `stock_prices.csv`: contains the close prices for all the companies that had been in the SP500. It contains a lot of missing data. The adjusted close price may be unavailable for three main reasons: - -- The company doesn't exist at date d -- The company is not public, pas coté -- Its close price hasn't been reported -- Note: The quality of this data set is not good: some prices are wrong, there are some prices spikes, there are some prices adjusments (share split, dividend distribution) - the prices adjusment are corrected in the adjusted close. But I'm not providing this data for this project to let you understand what is bad quality data and how important it is to detect outliers and missing values. The idea is not to correct the full the data set manually but to correct the main problems. -*Note: The corrections won't fix the data, as a result the results may be abnormal compared to results from cleaned financial data. That's not a problem for this small project ! * +- `sp500.csv` contains the SP500 data. The SP500 is a stock market index that measures the stock performance of 500 large companies listed on stock exchanges in the United States. -## Problem +- `stock_prices.csv`: contains the close prices for all the companies that had been in the SP500. It contains a lot of missing data. The adjusted close price may be unavailable for three main reasons: -Once, preprocessed, this data will be used to generate a signal that is, for each asset at each date a metric that indicates if the asset price will increase the next month. At each date (once a month) we will take the 20 highest metrics and invest 1$ per company. This strategy is called stock picking. It consists in picking stock in an index and try to overperfom the index. Finally we will compare the performance of our strategy compared to the benchmark: the SP500. + - The company doesn't exist at date d + - The company is not public, pas coté + - Its close price hasn't been reported + - Note: The quality of this data set is not good: some prices are wrong, there are some prices spikes, there are some prices adjustments (share split, dividend distribution) - the prices adjustment are corrected in the adjusted close. But I'm not providing this data for this project to let you understand what is bad quality data and how important it is to detect outliers and missing values. The idea is not to correct the full data set manually, but to correct the main problems. +_Note: The corrections will not fix the data, as a result the results may be abnormal compared to results from cleaned financial data. That's not a problem for this small project !_ - -It is important to understand that the SP500 components change over time. The reason is simple: Facebook entered the SP500 in ???? and as there are 500 companies +## Problem +Once preprocessed this data, it will be used to generate a signal that is, for each asset at each date a metric that indicates if the asset price will increase the next month. At each date (once a month) we will take the 20 highest metrics and invest 1$ per company. This strategy is called **stock picking**. It consists in picking stock in an index and try to overperform the index. Finally we will compare the performance of our strategy compared to the benchmark: the SP500 +It is important to understand that the SP500 components change over time. The reason is simple: Facebook entered the SP500 in 2013 thus meaning that another company had to be removed from the 500 companies. -The structure of the project is: +The structure of the project is: -``` +```console project │ README.md -│ environment.yml +│ environment.yml │ └───data │ │ sp500.csv │ | prices.csv -│ +│ └───notebook │ │ analysis.ipynb | @@ -48,104 +45,99 @@ project | │ create_signal.py | | backtester.py │ | main.py -│ +│ └───results │ plots │ results.txt │ outliers.txt +``` -``` - -There are four parts: +There are four parts: ## 1. Preliminary -- Create a function that takes as input one CSV data file, optimizes the types to reduce its size and returns a memory optimized DataFrame. -- For float data the smaller data type used is `np.float32` -- These steps may help you to implement the memory_reducer: +- Create a function that takes as input one CSV data file. This function should optimize the types to reduce its size and returns a memory optimized DataFrame. +- For `float` data the smaller data type used is `np.float32` +- These steps may help you to implement the memory_reducer: + +1. Iterate over every column +2. Determine if the column is numeric +3. Determine if the column can be represented by an integer +4. Find the min and the max value +5. Determine and apply the smallest datatype that can fit the range of values - 1. Iterate over every column - 2. Determine if the column is numeric - 3. Determine if the column can be represented by an integer - 4. Find the min and the max value - 5. Determine and apply the smallest datatype that can fit the range of values +## 2. Data wrangling and preprocessing +- Create a Jupyter Notebook to analyse the data sets and perform EDA (Exploratory Data Analysis). This notebook should contain at least: + - Missing values analysis + - Outliers analysis (there are a lot of outliers) + - One of average price for companies for all variables (save the plot with the images). + - Describe at least 5 outliers ('ticker', 'date', 'price'). Put them in `outliers.txt` file with the 3 fields on the folder `results`. -## 2. Data wrangling and preprocessing: +_Note: create functions that generate the plots and save them in the images folder. Add a parameter `plot` with a default value `False` which doesn't return the plot. This will be useful for the correction to let people run your code without overriding your plots._ - - Create a Jupyter Notebook to analyse the data sets and perform EDA (Exploratory Data Analysis). This notebook should contain at least: - - Missing values analysis - - Outliers analysis (there are a lot of outliers) - - One of average price for companies for all variables (save the plot with the images). - - Describe at least 5 outliers ('ticker', 'date', 'price'). Put them in `outliers.txt` file with the 3 fields on the folder `results`. +- Here is how the `prices` data should be preprocessed: -*Note: create functions that generate the plots and save them in the images folder. Add a parameter `plot` with a defaut value `False` which doesn't return the plot. This will be useful for the correction to let people run your code without overriding your plots.* + - Resample data on month and keep the last value + - Filter prices outliers: Remove prices outside of the range 0.1$, 10k$ + - Compute monthly returns: + - Historical returns. **returns(current month) = price(current month) - price(previous month) / price(previous month)** + - Future returns. **returns(current month) = price(next month) - price(current month) / price(current month)** - - Here is how the `prices` data should be preprocessed: - - Resample data on month and keep the last value - - Filter prices outliers: Remove prices outside of the range 0.1$, 10k$ - - Compute montly returns: - - Historical returns. **returns(current month) = price(current month) - price(previous month) / price(previous month)** - - Futur returns. **returns(current month) = price(next month) - price(current month) / price(current month)** - - Replace returns outliers by the last value available regarding the company. This corrects prices spikes that corresponds to a monthly return greater than 1 and smaller than -0.5. This correction shouldn't consider the 2008 and 2009 period as the financial crisis impacted the market brutally. **Don't forget that a value is considered as an outlier comparing to the other returns/prices of the same company** + - Replace returns outliers by the last value available regarding the company. This corrects prices spikes that corresponds to a monthly return greater than 1 and smaller than -0.5. This correction should not consider the 2008 and 2009 period as the financial crisis impacted the market brutally. **Don't forget that a value is considered as an outlier comparing to the other returns/prices of the same company** -At this stage the DataFrame should looks like this: +At this stage the DataFrame should looks like this: -| | Price | monthly_past_return | monthly_futur_return | -|:-----------------------------------------------------|---------:|----------------------:|-----------------------:| -| (Timestamp('2000-12-31 00:00:00', freq='M'), 'A') | 36.7304 | nan | -0.00365297 | -| (Timestamp('2000-12-31 00:00:00', freq='M'), 'AA') | 25.9505 | nan | 0.101194 | -| (Timestamp('2000-12-31 00:00:00', freq='M'), 'AAPL') | 1.00646 | nan | 0.452957 | -| (Timestamp('2000-12-31 00:00:00', freq='M'), 'ABC') | 11.4383 | nan | -0.0528713 | -| (Timestamp('2000-12-31 00:00:00', freq='M'), 'ABT') | 38.7945 | nan | -0.07205 | -1 +| | Price | monthly_past_return | monthly_future_return | +| :--------------------------------------------------- | ------: | ------------------: | -------------------: | +| (Timestamp('2000-12-31 00:00:00', freq='M'), 'A') | 36.7304 | nan | -0.00365297 | +| (Timestamp('2000-12-31 00:00:00', freq='M'), 'AA') | 25.9505 | nan | 0.101194 | +| (Timestamp('2000-12-31 00:00:00', freq='M'), 'AAPL') | 1.00646 | nan | 0.452957 | +| (Timestamp('2000-12-31 00:00:00', freq='M'), 'ABC') | 11.4383 | nan | -0.0528713 | +| (Timestamp('2000-12-31 00:00:00', freq='M'), 'ABT') | 38.7945 | nan | -0.07205 | - Fill the missing values using the last available value (same company) - Drop the missing values that can't be filled - Print `prices.isna().sum()` - - Here is how the `sp500.csv` data should be preprocessed: - - Resample data on month and keep the last value - - Compute historical monthly returns on the adjusted close - + - Resample data on month and keep the last value + - Compute historical monthly returns on the adjusted close -## 3. Create signal +## 3. Create signal -At this stage we have a data set with features that we will leverage to get an investment signal. As previously said, we will focus on one single variable to create the signal: **monthly_past_return**. The signal will be the average of monthy returns of the previous year +At this stage we have a data set with features that we will leverage to get an investment signal. As previously said, we will focus on one single variable to create the signal: **monthly_past_return**. The signal will be the average of monthly returns of the previous year -The naive assumption made here is that if a stock has performed well the last year it will perform well the next month. Moreover, we assume that we can buy stocks as soon as we have the signal (the signal is available at the close of day d and we assume that we can buy the stock at close of day d. The assumption is acceptable while considering monthly returns because the difference between the close of day d and the open of day d+1 is small comparing to the monthly return) +The naive assumption made here is that if a stock has performed well the last year it will perform well the next month. Moreover, we assume that we can buy stocks as soon as we have the signal (the signal is available at the close of day `d` and we assume that we can buy the stock at close of day `d`. The assumption is acceptable while considering monthly returns, because the difference between the close of day `d` and the open of day `d+1` is small comparing to the monthly return) - Create a column `average_return_1y` -- Create a column named `signal` that contains True if `average_return_1y` is among the 20 highest in the month `average_return_1y`. - +- Create a column named `signal` that contains `True` if `average_return_1y` is among the 20 highest in the month `average_return_1y`. ## 4. Backtester -At this stage we have an investment signal that indicates each month what are the 20 companies we should invest 1$ on (1$ each). In order to check the strategies' performance we will backtest our investment signal. +At this stage we have an investment signal that indicates each month what are the 20 companies we should invest 1$ on (1$ each). In order to check the strategies and performance we will backtest our investment signal. - Compute the PnL and the total return of our strategy without a for loop. Save the results in a text file `results.txt` in the folder `results`. - Compute the PnL and the total return of the strategy that consists in investing 20$ each day on the SP500. Compare. Save the results in a text file `results.txt` in the folder `results`. -- Create a plot that shows the performance of the strategy over time for the SP500 and the Stock Picking 20 strategy. -A data point (x-axis: date, y-axis: cumulated_return) is: the **cumulated returns** from the beginning of the strategy at date t. Save the plot in the results folder. +- Create a plot that shows the performance of the strategy over time for the SP500 and the Stock Picking 20 strategy. - This plot is used a lot in Finance because it helps to compare a custom strategy with in index. In that case we say that the SP500 is used as **benchmark** for the Stock Picking Strategy. +A data point (x-axis: date, y-axis: cumulated_return) is: the **cumulated returns** from the beginning of the strategy at date `t`. Save the plot in the results folder. +> This plot is used a lot in Finance because it helps to compare a custom strategy with in index. In that case we say that the SP500 is used as **benchmark** for the Stock Picking Strategy. ![alt text][performance] -[performance]: images/w1_weekend_plot_pnl.png "Cumulative Performance" +[performance]: images/w1_weekend_plot_pnl.png 'Cumulative Performance' ## 5. Main -Here is a sketch of `main.py`. - -``` -main.py +Here is a sketch of `main.py`. +```python +# main.py # import data prices, sp500 = memory_reducer(paths) @@ -157,8 +149,7 @@ prices, sp500 = preprocessing(prices, sp500) prices = create_signal(prices) #backtest - backtest(prices, sp500) ``` -**The command `python main.py` executes the code from data imports to the backtest and save the results.** \ No newline at end of file +**The command `python main.py` executes the code from data imports to the backtest and save the results.** \ No newline at end of file diff --git a/one_exercise_per_file/week02/day02/ex05/data/breast-cancer-wisconsin.data b/one_exercise_per_file/week02/day02/ex05/data/breast-cancer-wisconsin.data new file mode 100644 index 0000000..a50b76f --- /dev/null +++ b/one_exercise_per_file/week02/day02/ex05/data/breast-cancer-wisconsin.data @@ -0,0 +1,699 @@ +1000025,5,1,1,1,2,1,3,1,1,2 +1002945,5,4,4,5,7,10,3,2,1,2 +1015425,3,1,1,1,2,2,3,1,1,2 +1016277,6,8,8,1,3,4,3,7,1,2 +1017023,4,1,1,3,2,1,3,1,1,2 +1017122,8,10,10,8,7,10,9,7,1,4 +1018099,1,1,1,1,2,10,3,1,1,2 +1018561,2,1,2,1,2,1,3,1,1,2 +1033078,2,1,1,1,2,1,1,1,5,2 +1033078,4,2,1,1,2,1,2,1,1,2 +1035283,1,1,1,1,1,1,3,1,1,2 +1036172,2,1,1,1,2,1,2,1,1,2 +1041801,5,3,3,3,2,3,4,4,1,4 +1043999,1,1,1,1,2,3,3,1,1,2 +1044572,8,7,5,10,7,9,5,5,4,4 +1047630,7,4,6,4,6,1,4,3,1,4 +1048672,4,1,1,1,2,1,2,1,1,2 +1049815,4,1,1,1,2,1,3,1,1,2 +1050670,10,7,7,6,4,10,4,1,2,4 +1050718,6,1,1,1,2,1,3,1,1,2 +1054590,7,3,2,10,5,10,5,4,4,4 +1054593,10,5,5,3,6,7,7,10,1,4 +1056784,3,1,1,1,2,1,2,1,1,2 +1057013,8,4,5,1,2,?,7,3,1,4 +1059552,1,1,1,1,2,1,3,1,1,2 +1065726,5,2,3,4,2,7,3,6,1,4 +1066373,3,2,1,1,1,1,2,1,1,2 +1066979,5,1,1,1,2,1,2,1,1,2 +1067444,2,1,1,1,2,1,2,1,1,2 +1070935,1,1,3,1,2,1,1,1,1,2 +1070935,3,1,1,1,1,1,2,1,1,2 +1071760,2,1,1,1,2,1,3,1,1,2 +1072179,10,7,7,3,8,5,7,4,3,4 +1074610,2,1,1,2,2,1,3,1,1,2 +1075123,3,1,2,1,2,1,2,1,1,2 +1079304,2,1,1,1,2,1,2,1,1,2 +1080185,10,10,10,8,6,1,8,9,1,4 +1081791,6,2,1,1,1,1,7,1,1,2 +1084584,5,4,4,9,2,10,5,6,1,4 +1091262,2,5,3,3,6,7,7,5,1,4 +1096800,6,6,6,9,6,?,7,8,1,2 +1099510,10,4,3,1,3,3,6,5,2,4 +1100524,6,10,10,2,8,10,7,3,3,4 +1102573,5,6,5,6,10,1,3,1,1,4 +1103608,10,10,10,4,8,1,8,10,1,4 +1103722,1,1,1,1,2,1,2,1,2,2 +1105257,3,7,7,4,4,9,4,8,1,4 +1105524,1,1,1,1,2,1,2,1,1,2 +1106095,4,1,1,3,2,1,3,1,1,2 +1106829,7,8,7,2,4,8,3,8,2,4 +1108370,9,5,8,1,2,3,2,1,5,4 +1108449,5,3,3,4,2,4,3,4,1,4 +1110102,10,3,6,2,3,5,4,10,2,4 +1110503,5,5,5,8,10,8,7,3,7,4 +1110524,10,5,5,6,8,8,7,1,1,4 +1111249,10,6,6,3,4,5,3,6,1,4 +1112209,8,10,10,1,3,6,3,9,1,4 +1113038,8,2,4,1,5,1,5,4,4,4 +1113483,5,2,3,1,6,10,5,1,1,4 +1113906,9,5,5,2,2,2,5,1,1,4 +1115282,5,3,5,5,3,3,4,10,1,4 +1115293,1,1,1,1,2,2,2,1,1,2 +1116116,9,10,10,1,10,8,3,3,1,4 +1116132,6,3,4,1,5,2,3,9,1,4 +1116192,1,1,1,1,2,1,2,1,1,2 +1116998,10,4,2,1,3,2,4,3,10,4 +1117152,4,1,1,1,2,1,3,1,1,2 +1118039,5,3,4,1,8,10,4,9,1,4 +1120559,8,3,8,3,4,9,8,9,8,4 +1121732,1,1,1,1,2,1,3,2,1,2 +1121919,5,1,3,1,2,1,2,1,1,2 +1123061,6,10,2,8,10,2,7,8,10,4 +1124651,1,3,3,2,2,1,7,2,1,2 +1125035,9,4,5,10,6,10,4,8,1,4 +1126417,10,6,4,1,3,4,3,2,3,4 +1131294,1,1,2,1,2,2,4,2,1,2 +1132347,1,1,4,1,2,1,2,1,1,2 +1133041,5,3,1,2,2,1,2,1,1,2 +1133136,3,1,1,1,2,3,3,1,1,2 +1136142,2,1,1,1,3,1,2,1,1,2 +1137156,2,2,2,1,1,1,7,1,1,2 +1143978,4,1,1,2,2,1,2,1,1,2 +1143978,5,2,1,1,2,1,3,1,1,2 +1147044,3,1,1,1,2,2,7,1,1,2 +1147699,3,5,7,8,8,9,7,10,7,4 +1147748,5,10,6,1,10,4,4,10,10,4 +1148278,3,3,6,4,5,8,4,4,1,4 +1148873,3,6,6,6,5,10,6,8,3,4 +1152331,4,1,1,1,2,1,3,1,1,2 +1155546,2,1,1,2,3,1,2,1,1,2 +1156272,1,1,1,1,2,1,3,1,1,2 +1156948,3,1,1,2,2,1,1,1,1,2 +1157734,4,1,1,1,2,1,3,1,1,2 +1158247,1,1,1,1,2,1,2,1,1,2 +1160476,2,1,1,1,2,1,3,1,1,2 +1164066,1,1,1,1,2,1,3,1,1,2 +1165297,2,1,1,2,2,1,1,1,1,2 +1165790,5,1,1,1,2,1,3,1,1,2 +1165926,9,6,9,2,10,6,2,9,10,4 +1166630,7,5,6,10,5,10,7,9,4,4 +1166654,10,3,5,1,10,5,3,10,2,4 +1167439,2,3,4,4,2,5,2,5,1,4 +1167471,4,1,2,1,2,1,3,1,1,2 +1168359,8,2,3,1,6,3,7,1,1,4 +1168736,10,10,10,10,10,1,8,8,8,4 +1169049,7,3,4,4,3,3,3,2,7,4 +1170419,10,10,10,8,2,10,4,1,1,4 +1170420,1,6,8,10,8,10,5,7,1,4 +1171710,1,1,1,1,2,1,2,3,1,2 +1171710,6,5,4,4,3,9,7,8,3,4 +1171795,1,3,1,2,2,2,5,3,2,2 +1171845,8,6,4,3,5,9,3,1,1,4 +1172152,10,3,3,10,2,10,7,3,3,4 +1173216,10,10,10,3,10,8,8,1,1,4 +1173235,3,3,2,1,2,3,3,1,1,2 +1173347,1,1,1,1,2,5,1,1,1,2 +1173347,8,3,3,1,2,2,3,2,1,2 +1173509,4,5,5,10,4,10,7,5,8,4 +1173514,1,1,1,1,4,3,1,1,1,2 +1173681,3,2,1,1,2,2,3,1,1,2 +1174057,1,1,2,2,2,1,3,1,1,2 +1174057,4,2,1,1,2,2,3,1,1,2 +1174131,10,10,10,2,10,10,5,3,3,4 +1174428,5,3,5,1,8,10,5,3,1,4 +1175937,5,4,6,7,9,7,8,10,1,4 +1176406,1,1,1,1,2,1,2,1,1,2 +1176881,7,5,3,7,4,10,7,5,5,4 +1177027,3,1,1,1,2,1,3,1,1,2 +1177399,8,3,5,4,5,10,1,6,2,4 +1177512,1,1,1,1,10,1,1,1,1,2 +1178580,5,1,3,1,2,1,2,1,1,2 +1179818,2,1,1,1,2,1,3,1,1,2 +1180194,5,10,8,10,8,10,3,6,3,4 +1180523,3,1,1,1,2,1,2,2,1,2 +1180831,3,1,1,1,3,1,2,1,1,2 +1181356,5,1,1,1,2,2,3,3,1,2 +1182404,4,1,1,1,2,1,2,1,1,2 +1182410,3,1,1,1,2,1,1,1,1,2 +1183240,4,1,2,1,2,1,2,1,1,2 +1183246,1,1,1,1,1,?,2,1,1,2 +1183516,3,1,1,1,2,1,1,1,1,2 +1183911,2,1,1,1,2,1,1,1,1,2 +1183983,9,5,5,4,4,5,4,3,3,4 +1184184,1,1,1,1,2,5,1,1,1,2 +1184241,2,1,1,1,2,1,2,1,1,2 +1184840,1,1,3,1,2,?,2,1,1,2 +1185609,3,4,5,2,6,8,4,1,1,4 +1185610,1,1,1,1,3,2,2,1,1,2 +1187457,3,1,1,3,8,1,5,8,1,2 +1187805,8,8,7,4,10,10,7,8,7,4 +1188472,1,1,1,1,1,1,3,1,1,2 +1189266,7,2,4,1,6,10,5,4,3,4 +1189286,10,10,8,6,4,5,8,10,1,4 +1190394,4,1,1,1,2,3,1,1,1,2 +1190485,1,1,1,1,2,1,1,1,1,2 +1192325,5,5,5,6,3,10,3,1,1,4 +1193091,1,2,2,1,2,1,2,1,1,2 +1193210,2,1,1,1,2,1,3,1,1,2 +1193683,1,1,2,1,3,?,1,1,1,2 +1196295,9,9,10,3,6,10,7,10,6,4 +1196915,10,7,7,4,5,10,5,7,2,4 +1197080,4,1,1,1,2,1,3,2,1,2 +1197270,3,1,1,1,2,1,3,1,1,2 +1197440,1,1,1,2,1,3,1,1,7,2 +1197510,5,1,1,1,2,?,3,1,1,2 +1197979,4,1,1,1,2,2,3,2,1,2 +1197993,5,6,7,8,8,10,3,10,3,4 +1198128,10,8,10,10,6,1,3,1,10,4 +1198641,3,1,1,1,2,1,3,1,1,2 +1199219,1,1,1,2,1,1,1,1,1,2 +1199731,3,1,1,1,2,1,1,1,1,2 +1199983,1,1,1,1,2,1,3,1,1,2 +1200772,1,1,1,1,2,1,2,1,1,2 +1200847,6,10,10,10,8,10,10,10,7,4 +1200892,8,6,5,4,3,10,6,1,1,4 +1200952,5,8,7,7,10,10,5,7,1,4 +1201834,2,1,1,1,2,1,3,1,1,2 +1201936,5,10,10,3,8,1,5,10,3,4 +1202125,4,1,1,1,2,1,3,1,1,2 +1202812,5,3,3,3,6,10,3,1,1,4 +1203096,1,1,1,1,1,1,3,1,1,2 +1204242,1,1,1,1,2,1,1,1,1,2 +1204898,6,1,1,1,2,1,3,1,1,2 +1205138,5,8,8,8,5,10,7,8,1,4 +1205579,8,7,6,4,4,10,5,1,1,4 +1206089,2,1,1,1,1,1,3,1,1,2 +1206695,1,5,8,6,5,8,7,10,1,4 +1206841,10,5,6,10,6,10,7,7,10,4 +1207986,5,8,4,10,5,8,9,10,1,4 +1208301,1,2,3,1,2,1,3,1,1,2 +1210963,10,10,10,8,6,8,7,10,1,4 +1211202,7,5,10,10,10,10,4,10,3,4 +1212232,5,1,1,1,2,1,2,1,1,2 +1212251,1,1,1,1,2,1,3,1,1,2 +1212422,3,1,1,1,2,1,3,1,1,2 +1212422,4,1,1,1,2,1,3,1,1,2 +1213375,8,4,4,5,4,7,7,8,2,2 +1213383,5,1,1,4,2,1,3,1,1,2 +1214092,1,1,1,1,2,1,1,1,1,2 +1214556,3,1,1,1,2,1,2,1,1,2 +1214966,9,7,7,5,5,10,7,8,3,4 +1216694,10,8,8,4,10,10,8,1,1,4 +1216947,1,1,1,1,2,1,3,1,1,2 +1217051,5,1,1,1,2,1,3,1,1,2 +1217264,1,1,1,1,2,1,3,1,1,2 +1218105,5,10,10,9,6,10,7,10,5,4 +1218741,10,10,9,3,7,5,3,5,1,4 +1218860,1,1,1,1,1,1,3,1,1,2 +1218860,1,1,1,1,1,1,3,1,1,2 +1219406,5,1,1,1,1,1,3,1,1,2 +1219525,8,10,10,10,5,10,8,10,6,4 +1219859,8,10,8,8,4,8,7,7,1,4 +1220330,1,1,1,1,2,1,3,1,1,2 +1221863,10,10,10,10,7,10,7,10,4,4 +1222047,10,10,10,10,3,10,10,6,1,4 +1222936,8,7,8,7,5,5,5,10,2,4 +1223282,1,1,1,1,2,1,2,1,1,2 +1223426,1,1,1,1,2,1,3,1,1,2 +1223793,6,10,7,7,6,4,8,10,2,4 +1223967,6,1,3,1,2,1,3,1,1,2 +1224329,1,1,1,2,2,1,3,1,1,2 +1225799,10,6,4,3,10,10,9,10,1,4 +1226012,4,1,1,3,1,5,2,1,1,4 +1226612,7,5,6,3,3,8,7,4,1,4 +1227210,10,5,5,6,3,10,7,9,2,4 +1227244,1,1,1,1,2,1,2,1,1,2 +1227481,10,5,7,4,4,10,8,9,1,4 +1228152,8,9,9,5,3,5,7,7,1,4 +1228311,1,1,1,1,1,1,3,1,1,2 +1230175,10,10,10,3,10,10,9,10,1,4 +1230688,7,4,7,4,3,7,7,6,1,4 +1231387,6,8,7,5,6,8,8,9,2,4 +1231706,8,4,6,3,3,1,4,3,1,2 +1232225,10,4,5,5,5,10,4,1,1,4 +1236043,3,3,2,1,3,1,3,6,1,2 +1241232,3,1,4,1,2,?,3,1,1,2 +1241559,10,8,8,2,8,10,4,8,10,4 +1241679,9,8,8,5,6,2,4,10,4,4 +1242364,8,10,10,8,6,9,3,10,10,4 +1243256,10,4,3,2,3,10,5,3,2,4 +1270479,5,1,3,3,2,2,2,3,1,2 +1276091,3,1,1,3,1,1,3,1,1,2 +1277018,2,1,1,1,2,1,3,1,1,2 +128059,1,1,1,1,2,5,5,1,1,2 +1285531,1,1,1,1,2,1,3,1,1,2 +1287775,5,1,1,2,2,2,3,1,1,2 +144888,8,10,10,8,5,10,7,8,1,4 +145447,8,4,4,1,2,9,3,3,1,4 +167528,4,1,1,1,2,1,3,6,1,2 +169356,3,1,1,1,2,?,3,1,1,2 +183913,1,2,2,1,2,1,1,1,1,2 +191250,10,4,4,10,2,10,5,3,3,4 +1017023,6,3,3,5,3,10,3,5,3,2 +1100524,6,10,10,2,8,10,7,3,3,4 +1116116,9,10,10,1,10,8,3,3,1,4 +1168736,5,6,6,2,4,10,3,6,1,4 +1182404,3,1,1,1,2,1,1,1,1,2 +1182404,3,1,1,1,2,1,2,1,1,2 +1198641,3,1,1,1,2,1,3,1,1,2 +242970,5,7,7,1,5,8,3,4,1,2 +255644,10,5,8,10,3,10,5,1,3,4 +263538,5,10,10,6,10,10,10,6,5,4 +274137,8,8,9,4,5,10,7,8,1,4 +303213,10,4,4,10,6,10,5,5,1,4 +314428,7,9,4,10,10,3,5,3,3,4 +1182404,5,1,4,1,2,1,3,2,1,2 +1198641,10,10,6,3,3,10,4,3,2,4 +320675,3,3,5,2,3,10,7,1,1,4 +324427,10,8,8,2,3,4,8,7,8,4 +385103,1,1,1,1,2,1,3,1,1,2 +390840,8,4,7,1,3,10,3,9,2,4 +411453,5,1,1,1,2,1,3,1,1,2 +320675,3,3,5,2,3,10,7,1,1,4 +428903,7,2,4,1,3,4,3,3,1,4 +431495,3,1,1,1,2,1,3,2,1,2 +432809,3,1,3,1,2,?,2,1,1,2 +434518,3,1,1,1,2,1,2,1,1,2 +452264,1,1,1,1,2,1,2,1,1,2 +456282,1,1,1,1,2,1,3,1,1,2 +476903,10,5,7,3,3,7,3,3,8,4 +486283,3,1,1,1,2,1,3,1,1,2 +486662,2,1,1,2,2,1,3,1,1,2 +488173,1,4,3,10,4,10,5,6,1,4 +492268,10,4,6,1,2,10,5,3,1,4 +508234,7,4,5,10,2,10,3,8,2,4 +527363,8,10,10,10,8,10,10,7,3,4 +529329,10,10,10,10,10,10,4,10,10,4 +535331,3,1,1,1,3,1,2,1,1,2 +543558,6,1,3,1,4,5,5,10,1,4 +555977,5,6,6,8,6,10,4,10,4,4 +560680,1,1,1,1,2,1,1,1,1,2 +561477,1,1,1,1,2,1,3,1,1,2 +563649,8,8,8,1,2,?,6,10,1,4 +601265,10,4,4,6,2,10,2,3,1,4 +606140,1,1,1,1,2,?,2,1,1,2 +606722,5,5,7,8,6,10,7,4,1,4 +616240,5,3,4,3,4,5,4,7,1,2 +61634,5,4,3,1,2,?,2,3,1,2 +625201,8,2,1,1,5,1,1,1,1,2 +63375,9,1,2,6,4,10,7,7,2,4 +635844,8,4,10,5,4,4,7,10,1,4 +636130,1,1,1,1,2,1,3,1,1,2 +640744,10,10,10,7,9,10,7,10,10,4 +646904,1,1,1,1,2,1,3,1,1,2 +653777,8,3,4,9,3,10,3,3,1,4 +659642,10,8,4,4,4,10,3,10,4,4 +666090,1,1,1,1,2,1,3,1,1,2 +666942,1,1,1,1,2,1,3,1,1,2 +667204,7,8,7,6,4,3,8,8,4,4 +673637,3,1,1,1,2,5,5,1,1,2 +684955,2,1,1,1,3,1,2,1,1,2 +688033,1,1,1,1,2,1,1,1,1,2 +691628,8,6,4,10,10,1,3,5,1,4 +693702,1,1,1,1,2,1,1,1,1,2 +704097,1,1,1,1,1,1,2,1,1,2 +704168,4,6,5,6,7,?,4,9,1,2 +706426,5,5,5,2,5,10,4,3,1,4 +709287,6,8,7,8,6,8,8,9,1,4 +718641,1,1,1,1,5,1,3,1,1,2 +721482,4,4,4,4,6,5,7,3,1,2 +730881,7,6,3,2,5,10,7,4,6,4 +733639,3,1,1,1,2,?,3,1,1,2 +733639,3,1,1,1,2,1,3,1,1,2 +733823,5,4,6,10,2,10,4,1,1,4 +740492,1,1,1,1,2,1,3,1,1,2 +743348,3,2,2,1,2,1,2,3,1,2 +752904,10,1,1,1,2,10,5,4,1,4 +756136,1,1,1,1,2,1,2,1,1,2 +760001,8,10,3,2,6,4,3,10,1,4 +760239,10,4,6,4,5,10,7,1,1,4 +76389,10,4,7,2,2,8,6,1,1,4 +764974,5,1,1,1,2,1,3,1,2,2 +770066,5,2,2,2,2,1,2,2,1,2 +785208,5,4,6,6,4,10,4,3,1,4 +785615,8,6,7,3,3,10,3,4,2,4 +792744,1,1,1,1,2,1,1,1,1,2 +797327,6,5,5,8,4,10,3,4,1,4 +798429,1,1,1,1,2,1,3,1,1,2 +704097,1,1,1,1,1,1,2,1,1,2 +806423,8,5,5,5,2,10,4,3,1,4 +809912,10,3,3,1,2,10,7,6,1,4 +810104,1,1,1,1,2,1,3,1,1,2 +814265,2,1,1,1,2,1,1,1,1,2 +814911,1,1,1,1,2,1,1,1,1,2 +822829,7,6,4,8,10,10,9,5,3,4 +826923,1,1,1,1,2,1,1,1,1,2 +830690,5,2,2,2,3,1,1,3,1,2 +831268,1,1,1,1,1,1,1,3,1,2 +832226,3,4,4,10,5,1,3,3,1,4 +832567,4,2,3,5,3,8,7,6,1,4 +836433,5,1,1,3,2,1,1,1,1,2 +837082,2,1,1,1,2,1,3,1,1,2 +846832,3,4,5,3,7,3,4,6,1,2 +850831,2,7,10,10,7,10,4,9,4,4 +855524,1,1,1,1,2,1,2,1,1,2 +857774,4,1,1,1,3,1,2,2,1,2 +859164,5,3,3,1,3,3,3,3,3,4 +859350,8,10,10,7,10,10,7,3,8,4 +866325,8,10,5,3,8,4,4,10,3,4 +873549,10,3,5,4,3,7,3,5,3,4 +877291,6,10,10,10,10,10,8,10,10,4 +877943,3,10,3,10,6,10,5,1,4,4 +888169,3,2,2,1,4,3,2,1,1,2 +888523,4,4,4,2,2,3,2,1,1,2 +896404,2,1,1,1,2,1,3,1,1,2 +897172,2,1,1,1,2,1,2,1,1,2 +95719,6,10,10,10,8,10,7,10,7,4 +160296,5,8,8,10,5,10,8,10,3,4 +342245,1,1,3,1,2,1,1,1,1,2 +428598,1,1,3,1,1,1,2,1,1,2 +492561,4,3,2,1,3,1,2,1,1,2 +493452,1,1,3,1,2,1,1,1,1,2 +493452,4,1,2,1,2,1,2,1,1,2 +521441,5,1,1,2,2,1,2,1,1,2 +560680,3,1,2,1,2,1,2,1,1,2 +636437,1,1,1,1,2,1,1,1,1,2 +640712,1,1,1,1,2,1,2,1,1,2 +654244,1,1,1,1,1,1,2,1,1,2 +657753,3,1,1,4,3,1,2,2,1,2 +685977,5,3,4,1,4,1,3,1,1,2 +805448,1,1,1,1,2,1,1,1,1,2 +846423,10,6,3,6,4,10,7,8,4,4 +1002504,3,2,2,2,2,1,3,2,1,2 +1022257,2,1,1,1,2,1,1,1,1,2 +1026122,2,1,1,1,2,1,1,1,1,2 +1071084,3,3,2,2,3,1,1,2,3,2 +1080233,7,6,6,3,2,10,7,1,1,4 +1114570,5,3,3,2,3,1,3,1,1,2 +1114570,2,1,1,1,2,1,2,2,1,2 +1116715,5,1,1,1,3,2,2,2,1,2 +1131411,1,1,1,2,2,1,2,1,1,2 +1151734,10,8,7,4,3,10,7,9,1,4 +1156017,3,1,1,1,2,1,2,1,1,2 +1158247,1,1,1,1,1,1,1,1,1,2 +1158405,1,2,3,1,2,1,2,1,1,2 +1168278,3,1,1,1,2,1,2,1,1,2 +1176187,3,1,1,1,2,1,3,1,1,2 +1196263,4,1,1,1,2,1,1,1,1,2 +1196475,3,2,1,1,2,1,2,2,1,2 +1206314,1,2,3,1,2,1,1,1,1,2 +1211265,3,10,8,7,6,9,9,3,8,4 +1213784,3,1,1,1,2,1,1,1,1,2 +1223003,5,3,3,1,2,1,2,1,1,2 +1223306,3,1,1,1,2,4,1,1,1,2 +1223543,1,2,1,3,2,1,1,2,1,2 +1229929,1,1,1,1,2,1,2,1,1,2 +1231853,4,2,2,1,2,1,2,1,1,2 +1234554,1,1,1,1,2,1,2,1,1,2 +1236837,2,3,2,2,2,2,3,1,1,2 +1237674,3,1,2,1,2,1,2,1,1,2 +1238021,1,1,1,1,2,1,2,1,1,2 +1238464,1,1,1,1,1,?,2,1,1,2 +1238633,10,10,10,6,8,4,8,5,1,4 +1238915,5,1,2,1,2,1,3,1,1,2 +1238948,8,5,6,2,3,10,6,6,1,4 +1239232,3,3,2,6,3,3,3,5,1,2 +1239347,8,7,8,5,10,10,7,2,1,4 +1239967,1,1,1,1,2,1,2,1,1,2 +1240337,5,2,2,2,2,2,3,2,2,2 +1253505,2,3,1,1,5,1,1,1,1,2 +1255384,3,2,2,3,2,3,3,1,1,2 +1257200,10,10,10,7,10,10,8,2,1,4 +1257648,4,3,3,1,2,1,3,3,1,2 +1257815,5,1,3,1,2,1,2,1,1,2 +1257938,3,1,1,1,2,1,1,1,1,2 +1258549,9,10,10,10,10,10,10,10,1,4 +1258556,5,3,6,1,2,1,1,1,1,2 +1266154,8,7,8,2,4,2,5,10,1,4 +1272039,1,1,1,1,2,1,2,1,1,2 +1276091,2,1,1,1,2,1,2,1,1,2 +1276091,1,3,1,1,2,1,2,2,1,2 +1276091,5,1,1,3,4,1,3,2,1,2 +1277629,5,1,1,1,2,1,2,2,1,2 +1293439,3,2,2,3,2,1,1,1,1,2 +1293439,6,9,7,5,5,8,4,2,1,2 +1294562,10,8,10,1,3,10,5,1,1,4 +1295186,10,10,10,1,6,1,2,8,1,4 +527337,4,1,1,1,2,1,1,1,1,2 +558538,4,1,3,3,2,1,1,1,1,2 +566509,5,1,1,1,2,1,1,1,1,2 +608157,10,4,3,10,4,10,10,1,1,4 +677910,5,2,2,4,2,4,1,1,1,2 +734111,1,1,1,3,2,3,1,1,1,2 +734111,1,1,1,1,2,2,1,1,1,2 +780555,5,1,1,6,3,1,2,1,1,2 +827627,2,1,1,1,2,1,1,1,1,2 +1049837,1,1,1,1,2,1,1,1,1,2 +1058849,5,1,1,1,2,1,1,1,1,2 +1182404,1,1,1,1,1,1,1,1,1,2 +1193544,5,7,9,8,6,10,8,10,1,4 +1201870,4,1,1,3,1,1,2,1,1,2 +1202253,5,1,1,1,2,1,1,1,1,2 +1227081,3,1,1,3,2,1,1,1,1,2 +1230994,4,5,5,8,6,10,10,7,1,4 +1238410,2,3,1,1,3,1,1,1,1,2 +1246562,10,2,2,1,2,6,1,1,2,4 +1257470,10,6,5,8,5,10,8,6,1,4 +1259008,8,8,9,6,6,3,10,10,1,4 +1266124,5,1,2,1,2,1,1,1,1,2 +1267898,5,1,3,1,2,1,1,1,1,2 +1268313,5,1,1,3,2,1,1,1,1,2 +1268804,3,1,1,1,2,5,1,1,1,2 +1276091,6,1,1,3,2,1,1,1,1,2 +1280258,4,1,1,1,2,1,1,2,1,2 +1293966,4,1,1,1,2,1,1,1,1,2 +1296572,10,9,8,7,6,4,7,10,3,4 +1298416,10,6,6,2,4,10,9,7,1,4 +1299596,6,6,6,5,4,10,7,6,2,4 +1105524,4,1,1,1,2,1,1,1,1,2 +1181685,1,1,2,1,2,1,2,1,1,2 +1211594,3,1,1,1,1,1,2,1,1,2 +1238777,6,1,1,3,2,1,1,1,1,2 +1257608,6,1,1,1,1,1,1,1,1,2 +1269574,4,1,1,1,2,1,1,1,1,2 +1277145,5,1,1,1,2,1,1,1,1,2 +1287282,3,1,1,1,2,1,1,1,1,2 +1296025,4,1,2,1,2,1,1,1,1,2 +1296263,4,1,1,1,2,1,1,1,1,2 +1296593,5,2,1,1,2,1,1,1,1,2 +1299161,4,8,7,10,4,10,7,5,1,4 +1301945,5,1,1,1,1,1,1,1,1,2 +1302428,5,3,2,4,2,1,1,1,1,2 +1318169,9,10,10,10,10,5,10,10,10,4 +474162,8,7,8,5,5,10,9,10,1,4 +787451,5,1,2,1,2,1,1,1,1,2 +1002025,1,1,1,3,1,3,1,1,1,2 +1070522,3,1,1,1,1,1,2,1,1,2 +1073960,10,10,10,10,6,10,8,1,5,4 +1076352,3,6,4,10,3,3,3,4,1,4 +1084139,6,3,2,1,3,4,4,1,1,4 +1115293,1,1,1,1,2,1,1,1,1,2 +1119189,5,8,9,4,3,10,7,1,1,4 +1133991,4,1,1,1,1,1,2,1,1,2 +1142706,5,10,10,10,6,10,6,5,2,4 +1155967,5,1,2,10,4,5,2,1,1,2 +1170945,3,1,1,1,1,1,2,1,1,2 +1181567,1,1,1,1,1,1,1,1,1,2 +1182404,4,2,1,1,2,1,1,1,1,2 +1204558,4,1,1,1,2,1,2,1,1,2 +1217952,4,1,1,1,2,1,2,1,1,2 +1224565,6,1,1,1,2,1,3,1,1,2 +1238186,4,1,1,1,2,1,2,1,1,2 +1253917,4,1,1,2,2,1,2,1,1,2 +1265899,4,1,1,1,2,1,3,1,1,2 +1268766,1,1,1,1,2,1,1,1,1,2 +1277268,3,3,1,1,2,1,1,1,1,2 +1286943,8,10,10,10,7,5,4,8,7,4 +1295508,1,1,1,1,2,4,1,1,1,2 +1297327,5,1,1,1,2,1,1,1,1,2 +1297522,2,1,1,1,2,1,1,1,1,2 +1298360,1,1,1,1,2,1,1,1,1,2 +1299924,5,1,1,1,2,1,2,1,1,2 +1299994,5,1,1,1,2,1,1,1,1,2 +1304595,3,1,1,1,1,1,2,1,1,2 +1306282,6,6,7,10,3,10,8,10,2,4 +1313325,4,10,4,7,3,10,9,10,1,4 +1320077,1,1,1,1,1,1,1,1,1,2 +1320077,1,1,1,1,1,1,2,1,1,2 +1320304,3,1,2,2,2,1,1,1,1,2 +1330439,4,7,8,3,4,10,9,1,1,4 +333093,1,1,1,1,3,1,1,1,1,2 +369565,4,1,1,1,3,1,1,1,1,2 +412300,10,4,5,4,3,5,7,3,1,4 +672113,7,5,6,10,4,10,5,3,1,4 +749653,3,1,1,1,2,1,2,1,1,2 +769612,3,1,1,2,2,1,1,1,1,2 +769612,4,1,1,1,2,1,1,1,1,2 +798429,4,1,1,1,2,1,3,1,1,2 +807657,6,1,3,2,2,1,1,1,1,2 +8233704,4,1,1,1,1,1,2,1,1,2 +837480,7,4,4,3,4,10,6,9,1,4 +867392,4,2,2,1,2,1,2,1,1,2 +869828,1,1,1,1,1,1,3,1,1,2 +1043068,3,1,1,1,2,1,2,1,1,2 +1056171,2,1,1,1,2,1,2,1,1,2 +1061990,1,1,3,2,2,1,3,1,1,2 +1113061,5,1,1,1,2,1,3,1,1,2 +1116192,5,1,2,1,2,1,3,1,1,2 +1135090,4,1,1,1,2,1,2,1,1,2 +1145420,6,1,1,1,2,1,2,1,1,2 +1158157,5,1,1,1,2,2,2,1,1,2 +1171578,3,1,1,1,2,1,1,1,1,2 +1174841,5,3,1,1,2,1,1,1,1,2 +1184586,4,1,1,1,2,1,2,1,1,2 +1186936,2,1,3,2,2,1,2,1,1,2 +1197527,5,1,1,1,2,1,2,1,1,2 +1222464,6,10,10,10,4,10,7,10,1,4 +1240603,2,1,1,1,1,1,1,1,1,2 +1240603,3,1,1,1,1,1,1,1,1,2 +1241035,7,8,3,7,4,5,7,8,2,4 +1287971,3,1,1,1,2,1,2,1,1,2 +1289391,1,1,1,1,2,1,3,1,1,2 +1299924,3,2,2,2,2,1,4,2,1,2 +1306339,4,4,2,1,2,5,2,1,2,2 +1313658,3,1,1,1,2,1,1,1,1,2 +1313982,4,3,1,1,2,1,4,8,1,2 +1321264,5,2,2,2,1,1,2,1,1,2 +1321321,5,1,1,3,2,1,1,1,1,2 +1321348,2,1,1,1,2,1,2,1,1,2 +1321931,5,1,1,1,2,1,2,1,1,2 +1321942,5,1,1,1,2,1,3,1,1,2 +1321942,5,1,1,1,2,1,3,1,1,2 +1328331,1,1,1,1,2,1,3,1,1,2 +1328755,3,1,1,1,2,1,2,1,1,2 +1331405,4,1,1,1,2,1,3,2,1,2 +1331412,5,7,10,10,5,10,10,10,1,4 +1333104,3,1,2,1,2,1,3,1,1,2 +1334071,4,1,1,1,2,3,2,1,1,2 +1343068,8,4,4,1,6,10,2,5,2,4 +1343374,10,10,8,10,6,5,10,3,1,4 +1344121,8,10,4,4,8,10,8,2,1,4 +142932,7,6,10,5,3,10,9,10,2,4 +183936,3,1,1,1,2,1,2,1,1,2 +324382,1,1,1,1,2,1,2,1,1,2 +378275,10,9,7,3,4,2,7,7,1,4 +385103,5,1,2,1,2,1,3,1,1,2 +690557,5,1,1,1,2,1,2,1,1,2 +695091,1,1,1,1,2,1,2,1,1,2 +695219,1,1,1,1,2,1,2,1,1,2 +824249,1,1,1,1,2,1,3,1,1,2 +871549,5,1,2,1,2,1,2,1,1,2 +878358,5,7,10,6,5,10,7,5,1,4 +1107684,6,10,5,5,4,10,6,10,1,4 +1115762,3,1,1,1,2,1,1,1,1,2 +1217717,5,1,1,6,3,1,1,1,1,2 +1239420,1,1,1,1,2,1,1,1,1,2 +1254538,8,10,10,10,6,10,10,10,1,4 +1261751,5,1,1,1,2,1,2,2,1,2 +1268275,9,8,8,9,6,3,4,1,1,4 +1272166,5,1,1,1,2,1,1,1,1,2 +1294261,4,10,8,5,4,1,10,1,1,4 +1295529,2,5,7,6,4,10,7,6,1,4 +1298484,10,3,4,5,3,10,4,1,1,4 +1311875,5,1,2,1,2,1,1,1,1,2 +1315506,4,8,6,3,4,10,7,1,1,4 +1320141,5,1,1,1,2,1,2,1,1,2 +1325309,4,1,2,1,2,1,2,1,1,2 +1333063,5,1,3,1,2,1,3,1,1,2 +1333495,3,1,1,1,2,1,2,1,1,2 +1334659,5,2,4,1,1,1,1,1,1,2 +1336798,3,1,1,1,2,1,2,1,1,2 +1344449,1,1,1,1,1,1,2,1,1,2 +1350568,4,1,1,1,2,1,2,1,1,2 +1352663,5,4,6,8,4,1,8,10,1,4 +188336,5,3,2,8,5,10,8,1,2,4 +352431,10,5,10,3,5,8,7,8,3,4 +353098,4,1,1,2,2,1,1,1,1,2 +411453,1,1,1,1,2,1,1,1,1,2 +557583,5,10,10,10,10,10,10,1,1,4 +636375,5,1,1,1,2,1,1,1,1,2 +736150,10,4,3,10,3,10,7,1,2,4 +803531,5,10,10,10,5,2,8,5,1,4 +822829,8,10,10,10,6,10,10,10,10,4 +1016634,2,3,1,1,2,1,2,1,1,2 +1031608,2,1,1,1,1,1,2,1,1,2 +1041043,4,1,3,1,2,1,2,1,1,2 +1042252,3,1,1,1,2,1,2,1,1,2 +1057067,1,1,1,1,1,?,1,1,1,2 +1061990,4,1,1,1,2,1,2,1,1,2 +1073836,5,1,1,1,2,1,2,1,1,2 +1083817,3,1,1,1,2,1,2,1,1,2 +1096352,6,3,3,3,3,2,6,1,1,2 +1140597,7,1,2,3,2,1,2,1,1,2 +1149548,1,1,1,1,2,1,1,1,1,2 +1174009,5,1,1,2,1,1,2,1,1,2 +1183596,3,1,3,1,3,4,1,1,1,2 +1190386,4,6,6,5,7,6,7,7,3,4 +1190546,2,1,1,1,2,5,1,1,1,2 +1213273,2,1,1,1,2,1,1,1,1,2 +1218982,4,1,1,1,2,1,1,1,1,2 +1225382,6,2,3,1,2,1,1,1,1,2 +1235807,5,1,1,1,2,1,2,1,1,2 +1238777,1,1,1,1,2,1,1,1,1,2 +1253955,8,7,4,4,5,3,5,10,1,4 +1257366,3,1,1,1,2,1,1,1,1,2 +1260659,3,1,4,1,2,1,1,1,1,2 +1268952,10,10,7,8,7,1,10,10,3,4 +1275807,4,2,4,3,2,2,2,1,1,2 +1277792,4,1,1,1,2,1,1,1,1,2 +1277792,5,1,1,3,2,1,1,1,1,2 +1285722,4,1,1,3,2,1,1,1,1,2 +1288608,3,1,1,1,2,1,2,1,1,2 +1290203,3,1,1,1,2,1,2,1,1,2 +1294413,1,1,1,1,2,1,1,1,1,2 +1299596,2,1,1,1,2,1,1,1,1,2 +1303489,3,1,1,1,2,1,2,1,1,2 +1311033,1,2,2,1,2,1,1,1,1,2 +1311108,1,1,1,3,2,1,1,1,1,2 +1315807,5,10,10,10,10,2,10,10,10,4 +1318671,3,1,1,1,2,1,2,1,1,2 +1319609,3,1,1,2,3,4,1,1,1,2 +1323477,1,2,1,3,2,1,2,1,1,2 +1324572,5,1,1,1,2,1,2,2,1,2 +1324681,4,1,1,1,2,1,2,1,1,2 +1325159,3,1,1,1,2,1,3,1,1,2 +1326892,3,1,1,1,2,1,2,1,1,2 +1330361,5,1,1,1,2,1,2,1,1,2 +1333877,5,4,5,1,8,1,3,6,1,2 +1334015,7,8,8,7,3,10,7,2,3,4 +1334667,1,1,1,1,2,1,1,1,1,2 +1339781,1,1,1,1,2,1,2,1,1,2 +1339781,4,1,1,1,2,1,3,1,1,2 +13454352,1,1,3,1,2,1,2,1,1,2 +1345452,1,1,3,1,2,1,2,1,1,2 +1345593,3,1,1,3,2,1,2,1,1,2 +1347749,1,1,1,1,2,1,1,1,1,2 +1347943,5,2,2,2,2,1,1,1,2,2 +1348851,3,1,1,1,2,1,3,1,1,2 +1350319,5,7,4,1,6,1,7,10,3,4 +1350423,5,10,10,8,5,5,7,10,1,4 +1352848,3,10,7,8,5,8,7,4,1,4 +1353092,3,2,1,2,2,1,3,1,1,2 +1354840,2,1,1,1,2,1,3,1,1,2 +1354840,5,3,2,1,3,1,1,1,1,2 +1355260,1,1,1,1,2,1,2,1,1,2 +1365075,4,1,4,1,2,1,1,1,1,2 +1365328,1,1,2,1,2,1,2,1,1,2 +1368267,5,1,1,1,2,1,1,1,1,2 +1368273,1,1,1,1,2,1,1,1,1,2 +1368882,2,1,1,1,2,1,1,1,1,2 +1369821,10,10,10,10,5,10,10,10,7,4 +1371026,5,10,10,10,4,10,5,6,3,4 +1371920,5,1,1,1,2,1,3,2,1,2 +466906,1,1,1,1,2,1,1,1,1,2 +466906,1,1,1,1,2,1,1,1,1,2 +534555,1,1,1,1,2,1,1,1,1,2 +536708,1,1,1,1,2,1,1,1,1,2 +566346,3,1,1,1,2,1,2,3,1,2 +603148,4,1,1,1,2,1,1,1,1,2 +654546,1,1,1,1,2,1,1,1,8,2 +654546,1,1,1,3,2,1,1,1,1,2 +695091,5,10,10,5,4,5,4,4,1,4 +714039,3,1,1,1,2,1,1,1,1,2 +763235,3,1,1,1,2,1,2,1,2,2 +776715,3,1,1,1,3,2,1,1,1,2 +841769,2,1,1,1,2,1,1,1,1,2 +888820,5,10,10,3,7,3,8,10,2,4 +897471,4,8,6,4,3,4,10,6,1,4 +897471,4,8,8,5,4,5,10,4,1,4 diff --git a/one_exercise_per_file/week02/day02/ex05/data/breast-cancer-wisconsin.names b/one_exercise_per_file/week02/day02/ex05/data/breast-cancer-wisconsin.names new file mode 100644 index 0000000..54b59a1 --- /dev/null +++ b/one_exercise_per_file/week02/day02/ex05/data/breast-cancer-wisconsin.names @@ -0,0 +1,126 @@ +Citation Request: + This breast cancer databases was obtained from the University of Wisconsin + Hospitals, Madison from Dr. William H. Wolberg. If you publish results + when using this database, then please include this information in your + acknowledgements. Also, please cite one or more of: + + 1. O. L. Mangasarian and W. H. Wolberg: "Cancer diagnosis via linear + programming", SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18. + + 2. William H. Wolberg and O.L. Mangasarian: "Multisurface method of + pattern separation for medical diagnosis applied to breast cytology", + Proceedings of the National Academy of Sciences, U.S.A., Volume 87, + December 1990, pp 9193-9196. + + 3. O. L. Mangasarian, R. Setiono, and W.H. Wolberg: "Pattern recognition + via linear programming: Theory and application to medical diagnosis", + in: "Large-scale numerical optimization", Thomas F. Coleman and Yuying + Li, editors, SIAM Publications, Philadelphia 1990, pp 22-30. + + 4. K. P. Bennett & O. L. Mangasarian: "Robust linear programming + discrimination of two linearly inseparable sets", Optimization Methods + and Software 1, 1992, 23-34 (Gordon & Breach Science Publishers). + +1. Title: Wisconsin Breast Cancer Database (January 8, 1991) + +2. Sources: + -- Dr. WIlliam H. Wolberg (physician) + University of Wisconsin Hospitals + Madison, Wisconsin + USA + -- Donor: Olvi Mangasarian (mangasarian@cs.wisc.edu) + Received by David W. Aha (aha@cs.jhu.edu) + -- Date: 15 July 1992 + +3. Past Usage: + + Attributes 2 through 10 have been used to represent instances. + Each instance has one of 2 possible classes: benign or malignant. + + 1. Wolberg,~W.~H., \& Mangasarian,~O.~L. (1990). Multisurface method of + pattern separation for medical diagnosis applied to breast cytology. In + {\it Proceedings of the National Academy of Sciences}, {\it 87}, + 9193--9196. + -- Size of data set: only 369 instances (at that point in time) + -- Collected classification results: 1 trial only + -- Two pairs of parallel hyperplanes were found to be consistent with + 50% of the data + -- Accuracy on remaining 50% of dataset: 93.5% + -- Three pairs of parallel hyperplanes were found to be consistent with + 67% of data + -- Accuracy on remaining 33% of dataset: 95.9% + + 2. Zhang,~J. (1992). Selecting typical instances in instance-based + learning. In {\it Proceedings of the Ninth International Machine + Learning Conference} (pp. 470--479). Aberdeen, Scotland: Morgan + Kaufmann. + -- Size of data set: only 369 instances (at that point in time) + -- Applied 4 instance-based learning algorithms + -- Collected classification results averaged over 10 trials + -- Best accuracy result: + -- 1-nearest neighbor: 93.7% + -- trained on 200 instances, tested on the other 169 + -- Also of interest: + -- Using only typical instances: 92.2% (storing only 23.1 instances) + -- trained on 200 instances, tested on the other 169 + +4. Relevant Information: + + Samples arrive periodically as Dr. Wolberg reports his clinical cases. + The database therefore reflects this chronological grouping of the data. + This grouping information appears immediately below, having been removed + from the data itself: + + Group 1: 367 instances (January 1989) + Group 2: 70 instances (October 1989) + Group 3: 31 instances (February 1990) + Group 4: 17 instances (April 1990) + Group 5: 48 instances (August 1990) + Group 6: 49 instances (Updated January 1991) + Group 7: 31 instances (June 1991) + Group 8: 86 instances (November 1991) + ----------------------------------------- + Total: 699 points (as of the donated datbase on 15 July 1992) + + Note that the results summarized above in Past Usage refer to a dataset + of size 369, while Group 1 has only 367 instances. This is because it + originally contained 369 instances; 2 were removed. The following + statements summarizes changes to the original Group 1's set of data: + + ##### Group 1 : 367 points: 200B 167M (January 1989) + ##### Revised Jan 10, 1991: Replaced zero bare nuclei in 1080185 & 1187805 + ##### Revised Nov 22,1991: Removed 765878,4,5,9,7,10,10,10,3,8,1 no record + ##### : Removed 484201,2,7,8,8,4,3,10,3,4,1 zero epithelial + ##### : Changed 0 to 1 in field 6 of sample 1219406 + ##### : Changed 0 to 1 in field 8 of following sample: + ##### : 1182404,2,3,1,1,1,2,0,1,1,1 + +5. Number of Instances: 699 (as of 15 July 1992) + +6. Number of Attributes: 10 plus the class attribute + +7. Attribute Information: (class attribute has been moved to last column) + + # Attribute Domain + -- ----------------------------------------- + 1. Sample code number id number + 2. Clump Thickness 1 - 10 + 3. Uniformity of Cell Size 1 - 10 + 4. Uniformity of Cell Shape 1 - 10 + 5. Marginal Adhesion 1 - 10 + 6. Single Epithelial Cell Size 1 - 10 + 7. Bare Nuclei 1 - 10 + 8. Bland Chromatin 1 - 10 + 9. Normal Nucleoli 1 - 10 + 10. Mitoses 1 - 10 + 11. Class: (2 for benign, 4 for malignant) + +8. Missing attribute values: 16 + + There are 16 instances in Groups 1 to 6 that contain a single missing + (i.e., unavailable) attribute value, now denoted by "?". + +9. Class distribution: + + Benign: 458 (65.5%) + Malignant: 241 (34.5%) diff --git a/one_exercise_per_file/week02/day03/ex05/data/breast-cancer.csv b/one_exercise_per_file/week02/day03/ex05/data/breast-cancer.csv new file mode 100644 index 0000000..0efb95a --- /dev/null +++ b/one_exercise_per_file/week02/day03/ex05/data/breast-cancer.csv @@ -0,0 +1,286 @@ +"40-49","premeno","15-19","0-2","yes","3","right","left_up","no","recurrence-events" +"50-59","ge40","15-19","0-2","no","1","right","central","no","no-recurrence-events" +"50-59","ge40","35-39","0-2","no","2","left","left_low","no","recurrence-events" +"40-49","premeno","35-39","0-2","yes","3","right","left_low","yes","no-recurrence-events" +"40-49","premeno","30-34","3-5","yes","2","left","right_up","no","recurrence-events" +"50-59","premeno","25-29","3-5","no","2","right","left_up","yes","no-recurrence-events" +"50-59","ge40","40-44","0-2","no","3","left","left_up","no","no-recurrence-events" +"40-49","premeno","10-14","0-2","no","2","left","left_up","no","no-recurrence-events" +"40-49","premeno","0-4","0-2","no","2","right","right_low","no","no-recurrence-events" +"40-49","ge40","40-44","15-17","yes","2","right","left_up","yes","no-recurrence-events" +"50-59","premeno","25-29","0-2","no","2","left","left_low","no","no-recurrence-events" +"60-69","ge40","15-19","0-2","no","2","right","left_up","no","no-recurrence-events" +"50-59","ge40","30-34","0-2","no","1","right","central","no","no-recurrence-events" +"50-59","ge40","25-29","0-2","no","2","right","left_up","no","no-recurrence-events" +"40-49","premeno","25-29","0-2","no","2","left","left_low","yes","recurrence-events" +"30-39","premeno","20-24","0-2","no","3","left","central","no","no-recurrence-events" +"50-59","premeno","10-14","3-5","no","1","right","left_up","no","no-recurrence-events" +"60-69","ge40","15-19","0-2","no","2","right","left_up","no","no-recurrence-events" +"50-59","premeno","40-44","0-2","no","2","left","left_up","no","no-recurrence-events" +"50-59","ge40","20-24","0-2","no","3","left","left_up","no","no-recurrence-events" +"50-59","lt40","20-24","0-2",nan,"1","left","left_low","no","recurrence-events" +"60-69","ge40","40-44","3-5","no","2","right","left_up","yes","no-recurrence-events" +"50-59","ge40","15-19","0-2","no","2","right","left_low","no","no-recurrence-events" +"40-49","premeno","10-14","0-2","no","1","right","left_up","no","no-recurrence-events" +"30-39","premeno","15-19","6-8","yes","3","left","left_low","yes","recurrence-events" +"50-59","ge40","20-24","3-5","yes","2","right","left_up","no","no-recurrence-events" +"50-59","ge40","10-14","0-2","no","2","right","left_low","no","no-recurrence-events" +"40-49","premeno","10-14","0-2","no","1","right","left_up","no","no-recurrence-events" +"60-69","ge40","30-34","3-5","yes","3","left","left_low","no","no-recurrence-events" +"40-49","premeno","15-19","15-17","yes","3","left","left_low","no","recurrence-events" +"60-69","ge40","30-34","0-2","no","3","right","central","no","recurrence-events" +"60-69","ge40","25-29","3-5",nan,"1","right","left_low","yes","no-recurrence-events" +"50-59","ge40","25-29","0-2","no","3","left","right_up","no","no-recurrence-events" +"50-59","ge40","20-24","0-2","no","3","right","left_up","no","no-recurrence-events" +"40-49","premeno","30-34","0-2","no","1","left","left_low","yes","recurrence-events" +"30-39","premeno","15-19","0-2","no","1","left","left_low","no","no-recurrence-events" +"40-49","premeno","10-14","0-2","no","2","right","left_up","no","no-recurrence-events" +"60-69","ge40","45-49","6-8","yes","3","left","central","no","no-recurrence-events" +"40-49","ge40","20-24","0-2","no","3","left","left_low","no","no-recurrence-events" +"40-49","premeno","10-14","0-2","no","1","right","right_low","no","no-recurrence-events" +"30-39","premeno","35-39","0-2","no","3","left","left_low","no","recurrence-events" +"40-49","premeno","35-39","9-11","yes","2","right","right_up","yes","no-recurrence-events" +"60-69","ge40","25-29","0-2","no","2","right","left_low","no","no-recurrence-events" +"50-59","ge40","20-24","3-5","yes","3","right","right_up","no","recurrence-events" +"30-39","premeno","15-19","0-2","no","1","left","left_low","no","no-recurrence-events" +"50-59","premeno","30-34","0-2","no","3","left","right_up","no","recurrence-events" +"60-69","ge40","10-14","0-2","no","2","right","left_up","yes","no-recurrence-events" +"40-49","premeno","35-39","0-2","yes","3","right","left_up","yes","no-recurrence-events" +"50-59","premeno","50-54","0-2","yes","2","right","left_up","yes","no-recurrence-events" +"50-59","ge40","40-44","0-2","no","3","right","left_up","no","no-recurrence-events" +"70-79","ge40","15-19","9-11",nan,"1","left","left_low","yes","recurrence-events" +"50-59","lt40","30-34","0-2","no","3","right","left_up","no","no-recurrence-events" +"40-49","premeno","0-4","0-2","no","3","left","central","no","no-recurrence-events" +"70-79","ge40","40-44","0-2","no","1","right","right_up","no","no-recurrence-events" +"40-49","premeno","25-29","0-2",nan,"2","left","right_low","yes","no-recurrence-events" +"50-59","ge40","25-29","15-17","yes","3","right","left_up","no","no-recurrence-events" +"50-59","premeno","20-24","0-2","no","1","left","left_low","no","no-recurrence-events" +"50-59","ge40","35-39","15-17","no","3","left","left_low","no","no-recurrence-events" +"50-59","ge40","50-54","0-2","no","1","right","right_up","no","no-recurrence-events" +"30-39","premeno","0-4","0-2","no","2","right","central","no","recurrence-events" +"50-59","ge40","40-44","6-8","yes","3","left","left_low","yes","recurrence-events" +"40-49","premeno","30-34","0-2","no","2","right","right_up","yes","no-recurrence-events" +"40-49","ge40","20-24","0-2","no","3","left","left_up","no","no-recurrence-events" +"40-49","premeno","30-34","15-17","yes","3","left","left_low","no","recurrence-events" +"40-49","ge40","20-24","0-2","no","2","right","left_up","no","recurrence-events" +"50-59","ge40","15-19","0-2","no","1","right","central","no","no-recurrence-events" +"30-39","premeno","25-29","0-2","no","2","right","left_low","no","no-recurrence-events" +"60-69","ge40","15-19","0-2","no","2","left","left_low","no","no-recurrence-events" +"50-59","premeno","50-54","9-11","yes","2","right","left_up","no","recurrence-events" +"30-39","premeno","10-14","0-2","no","1","right","left_low","no","no-recurrence-events" +"50-59","premeno","25-29","3-5","yes","3","left","left_low","yes","recurrence-events" +"60-69","ge40","25-29","3-5",nan,"1","right","left_up","yes","no-recurrence-events" +"60-69","ge40","10-14","0-2","no","1","right","left_low","no","no-recurrence-events" +"50-59","ge40","30-34","6-8","yes","3","left","right_low","no","recurrence-events" +"30-39","premeno","25-29","6-8","yes","3","left","right_low","yes","recurrence-events" +"50-59","ge40","10-14","0-2","no","1","left","left_low","no","no-recurrence-events" +"50-59","premeno","15-19","0-2","no","1","left","left_low","no","no-recurrence-events" +"40-49","premeno","25-29","0-2","no","2","right","central","no","no-recurrence-events" +"40-49","premeno","25-29","0-2","no","3","left","right_up","no","recurrence-events" +"60-69","ge40","30-34","6-8","yes","2","right","right_up","no","no-recurrence-events" +"50-59","lt40","15-19","0-2","no","2","left","left_low","no","no-recurrence-events" +"40-49","premeno","25-29","0-2","no","2","right","left_low","no","no-recurrence-events" +"40-49","premeno","30-34","0-2","no","1","right","left_up","no","no-recurrence-events" +"60-69","ge40","15-19","0-2","no","2","left","left_up","yes","no-recurrence-events" +"30-39","premeno","0-4","0-2","no","2","right","central","no","no-recurrence-events" +"50-59","ge40","35-39","0-2","no","3","left","left_up","no","no-recurrence-events" +"40-49","premeno","40-44","0-2","no","1","right","left_up","no","no-recurrence-events" +"30-39","premeno","25-29","6-8","yes","2","right","left_up","yes","no-recurrence-events" +"50-59","ge40","20-24","0-2","no","1","right","left_low","no","no-recurrence-events" +"50-59","ge40","30-34","0-2","no","1","left","left_up","no","no-recurrence-events" +"60-69","ge40","20-24","0-2","no","1","right","left_up","no","recurrence-events" +"30-39","premeno","30-34","3-5","no","3","right","left_up","yes","recurrence-events" +"50-59","lt40","20-24","0-2",nan,"1","left","left_up","no","recurrence-events" +"50-59","premeno","10-14","0-2","no","2","right","left_up","no","no-recurrence-events" +"50-59","ge40","20-24","0-2","no","2","right","left_up","no","no-recurrence-events" +"40-49","premeno","45-49","0-2","no","2","left","left_low","yes","no-recurrence-events" +"30-39","premeno","40-44","0-2","no","1","left","left_up","no","recurrence-events" +"50-59","premeno","10-14","0-2","no","1","left","left_low","no","no-recurrence-events" +"60-69","ge40","30-34","0-2","no","3","right","left_up","yes","recurrence-events" +"40-49","premeno","35-39","0-2","no","1","right","left_up","no","recurrence-events" +"40-49","premeno","20-24","3-5","yes","2","left","left_low","yes","recurrence-events" +"50-59","premeno","15-19","0-2","no","2","left","left_low","no","recurrence-events" +"50-59","ge40","30-34","0-2","no","3","right","left_low","no","no-recurrence-events" +"60-69","ge40","20-24","0-2","no","2","left","left_up","no","no-recurrence-events" +"40-49","premeno","20-24","0-2","no","1","left","right_low","no","no-recurrence-events" +"60-69","ge40","30-34","3-5","yes","2","left","central","yes","recurrence-events" +"60-69","ge40","20-24","3-5","no","2","left","left_low","yes","recurrence-events" +"50-59","premeno","25-29","0-2","no","2","left","right_up","no","recurrence-events" +"50-59","ge40","30-34","0-2","no","1","right","right_up","no","no-recurrence-events" +"40-49","premeno","20-24","0-2","no","2","left","right_low","no","no-recurrence-events" +"60-69","ge40","15-19","0-2","no","1","right","left_up","no","no-recurrence-events" +"60-69","ge40","30-34","0-2","no","2","left","left_low","yes","no-recurrence-events" +"30-39","premeno","30-34","0-2","no","2","left","left_up","no","no-recurrence-events" +"30-39","premeno","40-44","3-5","no","3","right","right_up","yes","no-recurrence-events" +"60-69","ge40","5-9","0-2","no","1","left","central","no","no-recurrence-events" +"60-69","ge40","10-14","0-2","no","1","left","left_up","no","no-recurrence-events" +"40-49","premeno","30-34","6-8","yes","3","right","left_up","no","recurrence-events" +"60-69","ge40","10-14","0-2","no","1","left","left_up","no","no-recurrence-events" +"40-49","premeno","35-39","9-11","yes","2","right","left_up","yes","no-recurrence-events" +"40-49","premeno","20-24","0-2","no","1","right","left_low","no","no-recurrence-events" +"40-49","premeno","30-34","0-2","yes","3","right","right_up","no","recurrence-events" +"50-59","premeno","25-29","0-2","yes","2","left","left_up","no","no-recurrence-events" +"40-49","premeno","15-19","0-2","no","2","left","left_low","no","no-recurrence-events" +"30-39","premeno","35-39","9-11","yes","3","left","left_low","no","recurrence-events" +"30-39","premeno","10-14","0-2","no","2","left","right_low","no","no-recurrence-events" +"50-59","ge40","30-34","0-2","no","1","right","left_low","no","no-recurrence-events" +"60-69","ge40","30-34","0-2","no","2","left","left_up","no","no-recurrence-events" +"60-69","ge40","25-29","0-2","no","2","left","left_low","no","no-recurrence-events" +"40-49","premeno","15-19","0-2","no","2","left","left_up","no","recurrence-events" +"60-69","ge40","15-19","0-2","no","2","right","left_low","no","no-recurrence-events" +"40-49","premeno","30-34","0-2","no","2","left","right_low","no","no-recurrence-events" +"20-29","premeno","35-39","0-2","no","2","right","right_up","no","no-recurrence-events" +"40-49","premeno","30-34","0-2","no","3","right","right_up","no","recurrence-events" +"40-49","premeno","25-29","0-2","no","2","right","left_low","no","recurrence-events" +"30-39","premeno","30-34","0-2","no","3","left","left_low","no","no-recurrence-events" +"30-39","premeno","15-19","0-2","no","1","right","left_low","no","recurrence-events" +"50-59","ge40","0-4","0-2","no","1","right","central","no","no-recurrence-events" +"50-59","ge40","0-4","0-2","no","1","left","left_low","no","no-recurrence-events" +"60-69","ge40","50-54","0-2","no","3","right","left_up","no","recurrence-events" +"50-59","premeno","30-34","0-2","no","1","left","central","no","no-recurrence-events" +"60-69","ge40","20-24","15-17","yes","3","left","left_low","yes","recurrence-events" +"40-49","premeno","25-29","0-2","no","2","left","left_up","no","no-recurrence-events" +"40-49","premeno","30-34","3-5","no","2","right","left_up","no","recurrence-events" +"50-59","premeno","20-24","3-5","yes","2","left","left_low","no","no-recurrence-events" +"50-59","ge40","15-19","0-2","yes","2","left","central","yes","no-recurrence-events" +"50-59","premeno","10-14","0-2","no","3","left","left_low","no","no-recurrence-events" +"30-39","premeno","30-34","9-11","no","2","right","left_up","yes","recurrence-events" +"60-69","ge40","10-14","0-2","no","1","left","left_low","no","no-recurrence-events" +"40-49","premeno","40-44","0-2","no","2","right","left_low","no","no-recurrence-events" +"50-59","ge40","30-34","9-11",nan,"3","left","left_up","yes","no-recurrence-events" +"40-49","premeno","50-54","0-2","no","2","right","left_low","yes","recurrence-events" +"50-59","ge40","15-19","0-2","no","2","right","right_up","no","no-recurrence-events" +"50-59","ge40","40-44","3-5","yes","2","left","left_low","no","no-recurrence-events" +"30-39","premeno","25-29","3-5","yes","3","left","left_low","yes","recurrence-events" +"60-69","ge40","10-14","0-2","no","2","left","left_low","no","no-recurrence-events" +"60-69","lt40","10-14","0-2","no","1","left","right_up","no","no-recurrence-events" +"30-39","premeno","30-34","0-2","no","2","left","left_up","no","recurrence-events" +"30-39","premeno","20-24","3-5","yes","2","left","left_low","no","recurrence-events" +"50-59","ge40","10-14","0-2","no","1","right","left_up","no","no-recurrence-events" +"60-69","ge40","25-29","0-2","no","3","right","left_up","no","no-recurrence-events" +"50-59","ge40","25-29","3-5","yes","3","right","left_up","no","no-recurrence-events" +"40-49","premeno","30-34","6-8","no","2","left","left_up","no","no-recurrence-events" +"60-69","ge40","50-54","0-2","no","2","left","left_low","no","no-recurrence-events" +"50-59","premeno","30-34","0-2","no","3","left","left_low","no","no-recurrence-events" +"40-49","ge40","20-24","3-5","no","3","right","left_low","yes","recurrence-events" +"50-59","ge40","30-34","6-8","yes","2","left","right_low","yes","recurrence-events" +"60-69","ge40","25-29","3-5","no","2","right","right_up","no","recurrence-events" +"40-49","premeno","20-24","0-2","no","2","left","central","no","no-recurrence-events" +"40-49","premeno","20-24","0-2","no","2","left","left_up","no","no-recurrence-events" +"40-49","premeno","50-54","0-2","no","2","left","left_low","no","no-recurrence-events" +"50-59","ge40","20-24","0-2","no","2","right","central","no","recurrence-events" +"50-59","ge40","30-34","3-5","no","3","right","left_up","no","recurrence-events" +"40-49","ge40","25-29","0-2","no","2","left","left_low","no","no-recurrence-events" +"50-59","premeno","25-29","0-2","no","1","right","left_up","no","recurrence-events" +"40-49","premeno","40-44","3-5","yes","3","right","left_up","yes","no-recurrence-events" +"40-49","premeno","20-24","0-2","no","2","right","left_up","no","no-recurrence-events" +"40-49","premeno","20-24","3-5","no","2","right","left_up","no","no-recurrence-events" +"40-49","premeno","25-29","9-11","yes","3","right","left_up","no","recurrence-events" +"40-49","premeno","25-29","0-2","no","2","right","left_low","no","recurrence-events" +"40-49","premeno","20-24","0-2","no","1","right","right_up","no","no-recurrence-events" +"30-39","premeno","40-44","0-2","no","2","right","right_up","no","no-recurrence-events" +"60-69","ge40","10-14","6-8","yes","3","left","left_up","yes","recurrence-events" +"40-49","premeno","35-39","0-2","no","1","left","left_low","no","no-recurrence-events" +"50-59","ge40","30-34","3-5","no","3","left","left_low","no","recurrence-events" +"40-49","premeno","5-9","0-2","no","1","left","left_low","yes","no-recurrence-events" +"60-69","ge40","15-19","0-2","no","1","left","right_low","no","no-recurrence-events" +"40-49","premeno","30-34","0-2","no","3","right","right_up","no","no-recurrence-events" +"40-49","premeno","25-29","0-2","no","3","left","left_up","no","recurrence-events" +"50-59","ge40","5-9","0-2","no","2","right","right_up","no","no-recurrence-events" +"50-59","premeno","25-29","0-2","no","2","right","right_low","no","no-recurrence-events" +"50-59","premeno","25-29","0-2","no","2","left","right_up","no","recurrence-events" +"40-49","premeno","10-14","0-2","no","2","left","left_low","yes","no-recurrence-events" +"60-69","ge40","35-39","6-8","yes","3","left","left_low","no","recurrence-events" +"60-69","ge40","50-54","0-2","no","2","right","left_up","yes","no-recurrence-events" +"40-49","premeno","25-29","0-2","no","2","right","left_up","no","no-recurrence-events" +"30-39","premeno","20-24","3-5","no","2","right","central","no","no-recurrence-events" +"30-39","premeno","30-34","0-2","no","1","right","left_up","no","recurrence-events" +"60-69","lt40","30-34","0-2","no","1","left","left_low","no","no-recurrence-events" +"40-49","premeno","15-19","12-14","no","3","right","right_low","yes","no-recurrence-events" +"60-69","ge40","20-24","0-2","no","3","right","left_low","no","recurrence-events" +"30-39","premeno","5-9","0-2","no","2","left","right_low","no","no-recurrence-events" +"40-49","premeno","30-34","0-2","no","3","left","left_up","no","no-recurrence-events" +"60-69","ge40","30-34","0-2","no","3","left","left_low","no","no-recurrence-events" +"40-49","premeno","25-29","0-2","no","1","right","right_low","no","no-recurrence-events" +"40-49","premeno","25-29","0-2","no","1","left","right_low","no","no-recurrence-events" +"60-69","ge40","40-44","3-5","yes","3","right","left_low","no","recurrence-events" +"50-59","ge40","25-29","0-2","no","2","left","left_low","no","no-recurrence-events" +"50-59","premeno","30-34","0-2","no","3","right","left_up","yes","recurrence-events" +"40-49","ge40","30-34","3-5","no","3","left","left_low","no","recurrence-events" +"40-49","premeno","25-29","0-2","no","1","right","left_low","yes","no-recurrence-events" +"40-49","ge40","25-29","12-14","yes","3","left","right_low","yes","recurrence-events" +"40-49","premeno","40-44","0-2","no","1","left","left_low","no","recurrence-events" +"40-49","premeno","20-24","0-2","no","2","left","left_low","no","no-recurrence-events" +"50-59","ge40","25-29","0-2","no","1","left","right_low","no","no-recurrence-events" +"40-49","premeno","20-24","0-2","no","2","right","left_up","no","no-recurrence-events" +"70-79","ge40","40-44","0-2","no","1","right","left_up","no","no-recurrence-events" +"60-69","ge40","25-29","0-2","no","3","left","left_up","no","recurrence-events" +"50-59","premeno","25-29","0-2","no","2","left","left_low","no","no-recurrence-events" +"60-69","ge40","45-49","0-2","no","1","right","right_up","yes","recurrence-events" +"50-59","ge40","20-24","0-2","yes","2","right","left_up","no","no-recurrence-events" +"50-59","ge40","25-29","0-2","no","1","left","left_low","no","no-recurrence-events" +"50-59","ge40","20-24","0-2","no","3","left","left_up","no","no-recurrence-events" +"40-49","premeno","20-24","3-5","no","2","right","left_low","no","no-recurrence-events" +"50-59","ge40","35-39","0-2","no","2","left","left_up","no","no-recurrence-events" +"30-39","premeno","20-24","0-2","no","3","left","left_up","yes","recurrence-events" +"60-69","ge40","30-34","0-2","no","1","right","left_up","no","no-recurrence-events" +"60-69","ge40","25-29","0-2","no","3","right","left_low","no","no-recurrence-events" +"40-49","ge40","30-34","0-2","no","2","left","left_up","yes","no-recurrence-events" +"30-39","premeno","25-29","0-2","no","2","left","left_low","no","no-recurrence-events" +"40-49","premeno","20-24","0-2","no","2","left","left_low","no","recurrence-events" +"30-39","premeno","20-24","0-2","no","2","left","right_low","no","no-recurrence-events" +"40-49","premeno","10-14","0-2","no","2","right","left_low","no","no-recurrence-events" +"50-59","premeno","15-19","0-2","no","2","right","right_low","no","no-recurrence-events" +"50-59","premeno","25-29","0-2","no","1","right","left_up","no","no-recurrence-events" +"60-69","ge40","20-24","0-2","no","2","right","left_up","no","no-recurrence-events" +"60-69","ge40","40-44","0-2","no","2","right","left_low","no","recurrence-events" +"30-39","lt40","15-19","0-2","no","3","right","left_up","no","no-recurrence-events" +"40-49","premeno","30-34","12-14","yes","3","left","left_up","yes","recurrence-events" +"60-69","ge40","30-34","0-2","yes","2","right","right_up","yes","recurrence-events" +"50-59","ge40","40-44","6-8","yes","3","left","left_low","yes","recurrence-events" +"50-59","ge40","30-34","0-2","no","3","left",nan,"no","recurrence-events" +"70-79","ge40","10-14","0-2","no","2","left","central","no","no-recurrence-events" +"30-39","premeno","40-44","0-2","no","2","left","left_low","yes","no-recurrence-events" +"40-49","premeno","30-34","0-2","no","2","right","right_low","no","no-recurrence-events" +"40-49","premeno","30-34","0-2","no","1","left","left_low","no","no-recurrence-events" +"60-69","ge40","15-19","0-2","no","2","left","left_low","no","no-recurrence-events" +"40-49","premeno","10-14","0-2","no","2","left","left_low","no","no-recurrence-events" +"60-69","ge40","20-24","0-2","no","1","left","left_low","no","no-recurrence-events" +"50-59","ge40","10-14","0-2","no","1","left","left_up","no","no-recurrence-events" +"50-59","premeno","25-29","0-2","no","1","left","left_low","no","no-recurrence-events" +"50-59","ge40","30-34","9-11","yes","3","left","right_low","yes","recurrence-events" +"50-59","ge40","10-14","0-2","no","2","left","left_low","no","no-recurrence-events" +"40-49","premeno","30-34","0-2","no","1","left","right_up","no","no-recurrence-events" +"70-79","ge40","0-4","0-2","no","1","left","right_low","no","no-recurrence-events" +"40-49","premeno","25-29","0-2","no","3","right","left_up","yes","no-recurrence-events" +"50-59","premeno","25-29","0-2","no","3","right","left_low","yes","recurrence-events" +"50-59","ge40","40-44","0-2","no","2","left","left_low","no","no-recurrence-events" +"60-69","ge40","25-29","0-2","no","3","left","right_low","yes","recurrence-events" +"40-49","premeno","30-34","3-5","yes","2","right","left_low","no","no-recurrence-events" +"50-59","ge40","20-24","0-2","no","2","left","left_up","no","recurrence-events" +"70-79","ge40","20-24","0-2","no","3","left","left_up","no","no-recurrence-events" +"30-39","premeno","25-29","0-2","no","1","left","central","no","no-recurrence-events" +"60-69","ge40","30-34","0-2","no","2","left","left_low","no","no-recurrence-events" +"40-49","premeno","20-24","3-5","yes","2","right","right_up","yes","recurrence-events" +"50-59","ge40","30-34","9-11",nan,"3","left","left_low","yes","no-recurrence-events" +"50-59","ge40","0-4","0-2","no","2","left","central","no","no-recurrence-events" +"40-49","premeno","20-24","0-2","no","3","right","left_low","yes","no-recurrence-events" +"30-39","premeno","35-39","0-2","no","3","left","left_low","no","recurrence-events" +"60-69","ge40","30-34","0-2","no","1","left","left_up","no","no-recurrence-events" +"60-69","ge40","20-24","0-2","no","1","left","left_low","no","no-recurrence-events" +"50-59","ge40","25-29","6-8","no","3","left","left_low","yes","recurrence-events" +"50-59","premeno","35-39","15-17","yes","3","right","right_up","no","recurrence-events" +"30-39","premeno","20-24","3-5","yes","2","right","left_up","yes","no-recurrence-events" +"40-49","premeno","20-24","6-8","no","2","right","left_low","yes","no-recurrence-events" +"50-59","ge40","35-39","0-2","no","3","left","left_low","no","no-recurrence-events" +"50-59","premeno","35-39","0-2","no","2","right","left_up","no","no-recurrence-events" +"40-49","premeno","25-29","0-2","no","2","left","left_up","yes","no-recurrence-events" +"40-49","premeno","35-39","0-2","no","2","right","right_up","no","no-recurrence-events" +"50-59","premeno","30-34","3-5","yes","2","left","left_low","yes","no-recurrence-events" +"40-49","premeno","20-24","0-2","no","2","right","right_up","no","no-recurrence-events" +"60-69","ge40","15-19","0-2","no","3","right","left_up","yes","no-recurrence-events" +"50-59","ge40","30-34","6-8","yes","2","left","left_low","no","no-recurrence-events" +"50-59","premeno","25-29","3-5","yes","2","left","left_low","yes","no-recurrence-events" +"30-39","premeno","30-34","6-8","yes","2","right","right_up","no","no-recurrence-events" +"50-59","premeno","15-19","0-2","no","2","right","left_low","no","no-recurrence-events" +"50-59","ge40","40-44","0-2","no","3","left","right_up","no","no-recurrence-events" \ No newline at end of file diff --git a/one_exercise_per_file/week02/day03/ex05/data/breast_cancer_readme.txt b/one_exercise_per_file/week02/day03/ex05/data/breast_cancer_readme.txt new file mode 100644 index 0000000..ce7417c --- /dev/null +++ b/one_exercise_per_file/week02/day03/ex05/data/breast_cancer_readme.txt @@ -0,0 +1,73 @@ +Citation Request: + This breast cancer domain was obtained from the University Medical Centre, + Institute of Oncology, Ljubljana, Yugoslavia. Thanks go to M. Zwitter and + M. Soklic for providing the data. Please include this citation if you plan + to use this database. + +1. Title: Breast cancer data (Michalski has used this) + +2. Sources: + -- Matjaz Zwitter & Milan Soklic (physicians) + Institute of Oncology + University Medical Center + Ljubljana, Yugoslavia + -- Donors: Ming Tan and Jeff Schlimmer (Jeffrey.Schlimmer@a.gp.cs.cmu.edu) + -- Date: 11 July 1988 + +3. Past Usage: (Several: here are some) + -- Michalski,R.S., Mozetic,I., Hong,J., & Lavrac,N. (1986). The + Multi-Purpose Incremental Learning System AQ15 and its Testing + Application to Three Medical Domains. In Proceedings of the + Fifth National Conference on Artificial Intelligence, 1041-1045, + Philadelphia, PA: Morgan Kaufmann. + -- accuracy range: 66%-72% + -- Clark,P. & Niblett,T. (1987). Induction in Noisy Domains. In + Progress in Machine Learning (from the Proceedings of the 2nd + European Working Session on Learning), 11-30, Bled, + Yugoslavia: Sigma Press. + -- 8 test results given: 65%-72% accuracy range + -- Tan, M., & Eshelman, L. (1988). Using weighted networks to + represent classification knowledge in noisy domains. Proceedings + of the Fifth International Conference on Machine Learning, 121-134, + Ann Arbor, MI. + -- 4 systems tested: accuracy range was 68%-73.5% + -- Cestnik,G., Konenenko,I, & Bratko,I. (1987). Assistant-86: A + Knowledge-Elicitation Tool for Sophisticated Users. In I.Bratko + & N.Lavrac (Eds.) Progress in Machine Learning, 31-45, Sigma Press. + -- Assistant-86: 78% accuracy + +4. Relevant Information: + This is one of three domains provided by the Oncology Institute + that has repeatedly appeared in the machine learning literature. + (See also lymphography and primary-tumor.) + + This data set includes 201 instances of one class and 85 instances of + another class. The instances are described by 9 attributes, some of + which are linear and some are nominal. + +5. Number of Instances: 286 + +6. Number of Attributes: 9 + the class attribute + +7. Attribute Information: + 1. Class: no-recurrence-events, recurrence-events + 2. age: 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99. + 3. menopause: lt40, ge40, premeno. + 4. tumor-size: 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, + 45-49, 50-54, 55-59. + 5. inv-nodes: 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26, + 27-29, 30-32, 33-35, 36-39. + 6. node-caps: yes, no. + 7. deg-malig: 1, 2, 3. + 8. breast: left, right. + 9. breast-quad: left-up, left-low, right-up, right-low, central. + 10. irradiat: yes, no. + +8. Missing Attribute Values: (denoted by "?") + Attribute #: Number of instances with missing values: + 6. 8 + 9. 1. + +9. Class Distribution: + 1. no-recurrence-events: 201 instances + 2. recurrence-events: 85 instances \ No newline at end of file diff --git a/one_exercise_per_file/week02/day04/audit/readme.md b/one_exercise_per_file/week02/day04/audit/readme.md deleted file mode 100644 index e69de29..0000000 diff --git a/one_exercise_per_file/week02/day04/ex01/audit/readme.md b/one_exercise_per_file/week02/day04/ex01/audit/readme.md new file mode 100644 index 0000000..4b56476 --- /dev/null +++ b/one_exercise_per_file/week02/day04/ex01/audit/readme.md @@ -0,0 +1 @@ +1. This question is validated if the MSE outputted is **2.25**. \ No newline at end of file diff --git a/one_exercise_per_file/week02/day04/ex01/readme.md b/one_exercise_per_file/week02/day04/ex01/readme.md new file mode 100644 index 0000000..577ec77 --- /dev/null +++ b/one_exercise_per_file/week02/day04/ex01/readme.md @@ -0,0 +1,10 @@ +# Exercise 1 MSE Scikit-learn + +The goal of this exercise is to learn to use `sklearn.metrics` to compute the mean squared error (MSE). + +1. Compute the MSE using `sklearn.metrics` on `y_true` and `y_pred` below: + +```python +y_true = [91, 51, 2.5, 2, -5] +y_pred = [90, 48, 2, 2, -4] +``` \ No newline at end of file diff --git a/one_exercise_per_file/week02/day04/ex02/audit/readme.md b/one_exercise_per_file/week02/day04/ex02/audit/readme.md new file mode 100644 index 0000000..3ce30e6 --- /dev/null +++ b/one_exercise_per_file/week02/day04/ex02/audit/readme.md @@ -0,0 +1 @@ +1. This question is validated if the accuracy outputted is **0.5714285714285714**. diff --git a/one_exercise_per_file/week02/day04/ex02/readme.md b/one_exercise_per_file/week02/day04/ex02/readme.md new file mode 100644 index 0000000..f5d5ca5 --- /dev/null +++ b/one_exercise_per_file/week02/day04/ex02/readme.md @@ -0,0 +1,10 @@ +# Exercise 2 Accuracy Scikit-learn + +The goal of this exercise is to learn to use `sklearn.metrics` to compute the accuracy. + +1. Compute the accuracy using `sklearn.metrics` on `y_true` and `y_pred` below: + +```python +y_pred = [0, 1, 0, 1, 0, 1, 0] +y_true = [0, 0, 1, 1, 1, 1, 0] +``` \ No newline at end of file diff --git a/one_exercise_per_file/week02/day04/ex03/audit/readme.md b/one_exercise_per_file/week02/day04/ex03/audit/readme.md new file mode 100644 index 0000000..10bfac2 --- /dev/null +++ b/one_exercise_per_file/week02/day04/ex03/audit/readme.md @@ -0,0 +1,28 @@ +1. This question is validated if the predictions on the train set and test set are: + + ```console + # 10 first values Train + array([1.54505951, 2.21338527, 2.2636205 , 3.3258957 , 1.51710076, + 1.63209319, 2.9265211 , 0.78080924, 1.21968217, 0.72656239]) + ``` + + ```console + #10 first values Test + + array([ 1.82212706, 1.98357668, 0.80547979, -0.19259114, 1.76072418, + 3.27855815, 2.12056804, 1.96099917, 2.38239663, 1.21005304]) + ``` + +2. This question is validated if the results match this output: + + ```console + r2 on the train set: 0.3552292936915783 + MAE on the train set: 0.5300159371615256 + MSE on the train set: 0.5210784446797679 + + r2 on the test set: 0.30265471284464673 + MAE on the test set: 0.5454023699809112 + MSE on the test set: 0.5537420654727396 + ``` + +This result shows that the model has slightly better results on the train set than the test set. That's frequent since it is easier to get a better grade on an exam we studied than an exam that is different from what was prepared. However, the results are not good: r2 ~ 0.3. Fitting non linear models as the Random Forest on this data may improve the results. That's the goal of the exercise 5. \ No newline at end of file diff --git a/one_exercise_per_file/week02/day04/ex03/readme.md b/one_exercise_per_file/week02/day04/ex03/readme.md new file mode 100644 index 0000000..e7f3a4b --- /dev/null +++ b/one_exercise_per_file/week02/day04/ex03/readme.md @@ -0,0 +1,37 @@ +# Exercise 3 Regression + +The goal of this exercise is to learn to evaluate a machine learning model using many regression metrics. + +Preliminary: + +- Import California Housing data set and split it in a train set and a test set (10%). Fit a linear regression on the data set. *The goal is focus on the metrics, that is why the code to fit the Linear Regression is given.* + +```python +# imports +from sklearn.datasets import fetch_california_housing +from sklearn.model_selection import train_test_split +from sklearn.linear_model import LinearRegression +from sklearn.preprocessing import StandardScaler +from sklearn.impute import SimpleImputer +from sklearn.pipeline import Pipeline +# data +housing = fetch_california_housing() +X, y = housing['data'], housing['target'] +# split data train test +X_train, X_test, y_train, y_test = train_test_split(X, + y, + test_size=0.1, + shuffle=True, + random_state=13) +# pipeline +pipeline = [('imputer', SimpleImputer(strategy='median')), + ('scaler', StandardScaler()), + ('lr', LinearRegression())] +pipe = Pipeline(pipeline) +# fit +pipe.fit(X_train, y_train) +``` + +1. Predict on the train set and test set + +2. Compute R2, Mean Square Error, Mean Absolute Error on both train and test set \ No newline at end of file diff --git a/one_exercise_per_file/week02/day04/ex04/audit/readme.md b/one_exercise_per_file/week02/day04/ex04/audit/readme.md new file mode 100644 index 0000000..5dd7cd2 --- /dev/null +++ b/one_exercise_per_file/week02/day04/ex04/audit/readme.md @@ -0,0 +1,41 @@ +1. This question is validated if the predictions on the train set and test set are: + + ```console + # 10 first values Train + array([1, 0, 1, 1, 1, 0, 0, 1, 1, 0]) + + # 10 first values Test + array([1, 1, 0, 0, 0, 1, 1, 1, 0, 0]) + ``` + +2. This question is validated if the results match this output: + + ```console + F1 on the train set: 0.9911504424778761 + Accuracy on the train set: 0.989010989010989 + Recall on the train set: 0.9929078014184397 + Precision on the train set: 0.9893992932862191 + ROC_AUC on the train set: 0.9990161111794368 + + + F1 on the test set: 0.9801324503311258 + Accuracy on the test set: 0.9736842105263158 + Recall on the test set: 0.9866666666666667 + Precision on the test set: 0.9736842105263158 + ROC_AUC on the test set: 0.9863247863247864 + ``` + + The confusion matrix on the test set should be: + + ```console + array([[37, 2], + [ 1, 74]]) + ``` + +3. The ROC AUC plot should look like: + +![alt text][logo_ex4] + +[logo_ex4]: ../images/w2_day4_ex4_q3.png "ROC AUC " + +Having a 99% ROC AUC is not usual. The data set we used is easy to classify. On real data sets, always check if there's any leakage while having such a high ROC AUC score. diff --git a/one_exercise_per_file/week02/day04/ex04/images/w2_day4_ex4_q3.png b/one_exercise_per_file/week02/day04/ex04/images/w2_day4_ex4_q3.png new file mode 100644 index 0000000..1f79eb7 Binary files /dev/null and b/one_exercise_per_file/week02/day04/ex04/images/w2_day4_ex4_q3.png differ diff --git a/one_exercise_per_file/week02/day04/ex04/readme.md b/one_exercise_per_file/week02/day04/ex04/readme.md new file mode 100644 index 0000000..0248d2b --- /dev/null +++ b/one_exercise_per_file/week02/day04/ex04/readme.md @@ -0,0 +1,36 @@ +# Exercise 4 Classification + +The goal of this exercise is to learn to evaluate a machine learning model using many classification metrics. + +Preliminary: + +- Import Breast Cancer data set and split it in a train set and a test set (20%). Fit a logistic regression on the data set. *The goal is focus on the metrics, that is why the code to fit the logistic Regression is given.* + +```python +from sklearn.linear_model import LogisticRegression +from sklearn.datasets import load_breast_cancer +from sklearn.model_selection import train_test_split +from sklearn.preprocessing import StandardScaler + +X , y = load_breast_cancer(return_X_y=True) +X_train, X_test, y_train, y_test = train_test_split( + X, y, test_size=0.20) +scaler = StandardScaler() +X_train_scaled = scaler.fit_transform(X_train) +classifier = LogisticRegression() +classifier.fit(X_train_scaled, y_train) +``` + +1. Predict on the train set and test set + +2. Compute F1, accuracy, precision, recall, roc_auc scores on the train set and test set. Print the confusion matrix on the test set results. + +**Note: AUC can only be computed on probabilities, not on classes.** + +3. Plot the AUC curve for on the test set using roc_curve of scikit learn. There many ways to create this plot. It should look like this: + +![alt text][logo_ex4] + +[logo_ex4]: images/w2_day4_ex4_q3.png "ROC AUC " + +- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_roc_curve.html \ No newline at end of file diff --git a/one_exercise_per_file/week02/day04/ex05/audit/readme.md b/one_exercise_per_file/week02/day04/ex05/audit/readme.md new file mode 100644 index 0000000..e3bfa1c --- /dev/null +++ b/one_exercise_per_file/week02/day04/ex05/audit/readme.md @@ -0,0 +1,72 @@ +1. Some of the algorithms use random steps (random sampling used by the `RandomForest`). I used `random_state = 43` for the Random Forest, the Decision Tree and the Gradient Boosting. This question is validated of the scores you got are close to: + + ```console + # Linear regression + + TRAIN + r2 on the train set: 0.34823544284172625 + MAE on the train set: 0.533092001261455 + MSE on the train set: 0.5273648371379568 + + TEST + r2 on the test set: 0.3551785428138914 + MAE on the test set: 0.5196420310323713 + MSE on the test set: 0.49761195027083804 + + + # SVM + + TRAIN + r2 on the train set: 0.6462366150965996 + MAE on the train set: 0.38356451633259875 + MSE on the train set: 0.33464478671339165 + + TEST + r2 on the test set: 0.6162644671183826 + MAE on the test set: 0.3897680598426786 + MSE on the test set: 0.3477101776543003 + + + # Decision Tree + + TRAIN + r2 on the train set: 0.9999999999999488 + MAE on the train set: 1.3685733933909677e-08 + MSE on the train set: 6.842866883530944e-14 + + TEST + r2 on the test set: 0.6263651902480918 + MAE on the test set: 0.4383758696244002 + MSE on the test set: 0.4727017198871596 + + + # Random Forest + + TRAIN + r2 on the train set: 0.9705418471542886 + MAE on the train set: 0.11983836612191189 + MSE on the train set: 0.034538356420577995 + + TEST + r2 on the test set: 0.7504673649554309 + MAE on the test set: 0.31889891600404635 + MSE on the test set: 0.24096164834441108 + + + # Gradient Boosting + + TRAIN + r2 on the train set: 0.7395782392433273 + MAE on the train set: 0.35656543036682264 + MSE on the train set: 0.26167490389525294 + + TEST + r2 on the test set: 0.7157456298013534 + MAE on the test set: 0.36455447680396397 + MSE on the test set: 0.27058170064218096 + + ``` + +It is important to notice that the Decision Tree over fits very easily. It learns easily the training data but is not able to extrapolate on the test set. This algorithm is not used a lot. + +However, Random Forest and Gradient Boosting propose a solid approach to correct the over fitting (in that case the parameters `max_depth` is set to None that is why the Random Forest over fits the data). These two algorithms are used intensively in Machine Learning Projects. diff --git a/one_exercise_per_file/week02/day04/ex05/readme.md b/one_exercise_per_file/week02/day04/ex05/readme.md new file mode 100644 index 0000000..8e4df06 --- /dev/null +++ b/one_exercise_per_file/week02/day04/ex05/readme.md @@ -0,0 +1,55 @@ +# Exercise 5 Machine Learning models + +The goal of this exercise is to have an overview of the existing Machine Learning models and to learn to call them from scikit learn. +We will focus on: + +- SVM/SVC +- Decision Tree +- Random Forest (Ensemble learning) +- Gradient Boosting (Ensemble learning, Boosting techniques) + +All these algorithms exist in two versions: regression and classification. Even if the logic is similar in both classification and regression, the loss function is specific to each case. + +It is really easy to get lost among all the existing algorithms. This article is very useful to have a clear overview of the models and to understand which algorithm use and when. https://towardsdatascience.com/how-to-choose-the-right-machine-learning-algorithm-for-your-application-1e36c32400b9 + +Preliminary: + +- Import California Housing data set and split it in a train set and a test set (10%). Fit a linear regression on the data set. *The goal is to focus on the metrics, that is why the code to fit the Linear Regression is given.* + +```python +# imports +from sklearn.datasets import fetch_california_housing +from sklearn.model_selection import train_test_split +from sklearn.linear_model import LinearRegression +from sklearn.preprocessing import StandardScaler +from sklearn.impute import SimpleImputer +from sklearn.pipeline import Pipeline +# data +housing = fetch_california_housing() +X, y = housing['data'], housing['target'] +# split data train test +X_train, X_test, y_train, y_test = train_test_split(X, + y, + test_size=0.1, + shuffle=True, + random_state=43) +# pipeline +pipeline = [('imputer', SimpleImputer(strategy='median')), + ('scaler', StandardScaler()), + ('lr', LinearRegression())] +pipe = Pipeline(pipeline) +# fit +pipe.fit(X_train, y_train) + +``` + +1. Create 5 pipelines with 5 different models as final estimator (keep the imputer and scaler unchanged): + 1. Linear Regression + 2. SVM + 3. Decision Tree (set `random_state=43`) + 4. Random Forest (set `random_state=43`) + 5. Gradient Boosting (set `random_state=43`) + +Take time to have basic understanding of the role of the basic hyperparameter and their default value. + +- For each algorithm, print the R2, MSE and MAE on both train set and test set. \ No newline at end of file diff --git a/one_exercise_per_file/week02/day04/ex06/audit/readme.md b/one_exercise_per_file/week02/day04/ex06/audit/readme.md new file mode 100644 index 0000000..2244157 --- /dev/null +++ b/one_exercise_per_file/week02/day04/ex06/audit/readme.md @@ -0,0 +1,31 @@ +1. This question is validated if the code that runs the `gridsearch` is (the parameters may change): + +```python +parameters = {'n_estimators':[10, 50, 75], + 'max_depth':[3,5,7], + 'min_samples_leaf': [10,20,30]} + +rf = RandomForestRegressor() +gridsearch = GridSearchCV(rf, + parameters, + cv = [(np.arange(18576), np.arange(18576,20640))], + n_jobs=-1) +gridsearch.fit(X, y) +``` + +2. This question is validated if the function is: + +```python +def select_model_verbose(gs): + + return gs.best_estimator_, gs.best_params_, gs.best_score_ +``` + +In my case, the `gridsearch` parameters are not interesting. Even if I reduced the over fitting of the Random Forest, the score on the test is lower than the score on the test returned by the Gradient Boosting in the previous exercise without optimal parameters search. + +3. This question is validated if the code used is: + +```python +model, best_params, best_score = select_model_verbose(gridsearch) +model.predict(new_point) +``` diff --git a/one_exercise_per_file/week02/day04/ex06/readme.md b/one_exercise_per_file/week02/day04/ex06/readme.md new file mode 100644 index 0000000..090605e --- /dev/null +++ b/one_exercise_per_file/week02/day04/ex06/readme.md @@ -0,0 +1,51 @@ +# Exercise 6 Grid Search + +The goal of this exercise is to learn how to make an exhaustive search over specified parameter values for an estimator. This is very useful because the hyperparameter which are the parameters of the model impact the performance of the model. + +The scikit learn object that runs the Grid Search is called GridSearchCV. We will learn tomorrow about the cross validation. For now, let us set the parameter **cv** to `[(np.arange(18576), np.arange(18576,20640))]`. +This means that GridSearchCV splits the data set in a train and test set. + +Preliminary: + +- Load the California Housing data set. As precised, this time, there's no need to split the data set in train set and test set since GridSearchCV does it. + +You will have to run a Grid Search on the Random Forest on at least the hyperparameter that are mentioned below. It doesn't mean these are the only hyperparameter of the model. If possible, try at least 3 different values for each hyperparameter. + +1. Run a Grid Search with `n_jobs` set to `-1` to parallelize the computations on all CPUs. The hyperparameter to change are: n_estimators, max_depth, min_samples_leaf. It may take + +Now, let us analyse the grid search's results in order to select the best model. + +2. Write a function that takes as input the Grid Search object and that returns the best model **fitted**, the best set of hyperparameter and the associated score: + + ```python + def select_model_verbose(gs): + + return trained_model, best_params, best_score + ``` + +3. Use the trained model to predict on a new point: + +```python +new_point = np.array([[3.2031, 52., 5.47761194, 1.07960199, 910., 2.26368159, 37.85, -122.26]]) +``` + +How do we know the best model returned by GridSearchCV is good enough and stable ? That is what we will learn tomorrow ! + +**WARNING: Some combinations of hyper parameters are not possible. For example using the SVM, the kernel linear has no parameter gamma.** + +**Note**: + +- GridSearchCV can also take a Pipeline instead of a Machine Learning model. It is useful to combine some Imputers or Dimension reduction techniques with some Machine Learning models in the same Pipeline. +- It may be useful to check on Kaggle if some Kagglers share their Grid Searches. + +Ressources: + +- https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html + +- https://stackoverflow.com/questions/38555650/try-multiple-estimator-in-one-grid-search + +- https://medium.com/fintechexplained/what-is-grid-search-c01fe886ef0a + +- https://elutins.medium.com/grid-searching-in-machine-learning-quick-explanation-and-python-implementation-550552200596 + +- https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html \ No newline at end of file diff --git a/one_exercise_per_file/week02/day04/readme.md b/one_exercise_per_file/week02/day04/readme.md index e69de29..830e5b2 100644 --- a/one_exercise_per_file/week02/day04/readme.md +++ b/one_exercise_per_file/week02/day04/readme.md @@ -0,0 +1,34 @@ +# D04 Piscine AI - Data Science + +# Table of Contents: + +# Introduction + +Today we will learn how to choose the right Machine Learning metric depending on the problem you are solving and to compute it. A metric gives an idea of how good the model performs. Depending on working on a classification problem or a regression problem the metrics considered are different. It is important to understand that all metrics are just metrics, not the truth. + +We will focus on the most important metrics: + +- Regression: + - **R2**, **Mean Square Error**, **Mean Absolute Error** +- Classification: + - **F1 score**, **accuracy**, **precision**, **recall** and **AUC scores**. Even if it not considered as a metric, the **confusion matrix** is always useful to understand the model performance. + +Warning: **Imbalanced data set** + +Let us assume we are predicting a rare event that occurs less than 2% of the time. Having a model that scores a good accuracy is easy, it doesn't have to be "smart", all it has to do is to always predict the majority class. Depending on the problem it can be disastrous. For example, working with real life data, breast cancer prediction is an imbalanced problem where predicting the majority leads to disastrous consequences. That is why metrics as AUC are useful. + +- https://stats.stackexchange.com/questions/260164/auc-and-class-imbalance-in-training-test-dataset + +Before to compute the metrics, read carefully this article to understand the role of these metrics. + +- https://www.kdnuggets.com/2018/06/right-metric-evaluating-machine-learning-models-2.html + ++ ML models + GS + +## Historical + +## Rules + +## Resources + +- https://scikit-learn.org/stable/modules/model_evaluation.html diff --git a/one_exercise_per_file/week02/day05/audit/readme.md b/one_exercise_per_file/week02/day05/audit/readme.md deleted file mode 100644 index e69de29..0000000 diff --git a/one_exercise_per_file/week02/day05/ex01/audit/readme.md b/one_exercise_per_file/week02/day05/ex01/audit/readme.md new file mode 100644 index 0000000..3b945de --- /dev/null +++ b/one_exercise_per_file/week02/day05/ex01/audit/readme.md @@ -0,0 +1,18 @@ +1. This question is validated if the output of the 5-fold cross validation is: + + ```console + Fold: 1 + TRAIN: [2 3 4 5 6 7 8 9] TEST: [0 1] + + Fold: 2 + TRAIN: [0 1 4 5 6 7 8 9] TEST: [2 3] + + Fold: 3 + TRAIN: [0 1 2 3 6 7 8 9] TEST: [4 5] + + Fold: 4 + TRAIN: [0 1 2 3 4 5 8 9] TEST: [6 7] + + Fold: 5 + TRAIN: [0 1 2 3 4 5 6 7] TEST: [8 9] + ``` \ No newline at end of file diff --git a/one_exercise_per_file/week02/day05/ex01/readme.md b/one_exercise_per_file/week02/day05/ex01/readme.md new file mode 100644 index 0000000..b8241fc --- /dev/null +++ b/one_exercise_per_file/week02/day05/ex01/readme.md @@ -0,0 +1,27 @@ +# Exercise 1: K-Fold + +The goal of this exercise is to learn to use `KFold` to split the data set in a k-fold cross validation. Most of the time you won't use this function to split your data because this function is used by others as `cross_val_score` or `cross_validate` or `GridSearchCV` ... . But, this allows to understand the splitting and to create a custom one if needed. + +```python +X = np.array(np.arange(1,21).reshape(10,-1)) +y = np.array(np.arange(1,11)) +``` + +1. Using `KFold`, perform a 5-fold cross validation. For each fold, print the train index and test index. The expected output is: + + ```console + Fold: 1 + TRAIN: [2 3 4 5 6 7 8 9] TEST: [0 1] + + Fold: 2 + TRAIN: [0 1 4 5 6 7 8 9] TEST: [2 3] + + Fold: 3 + TRAIN: [0 1 2 3 6 7 8 9] TEST: [4 5] + + Fold: 4 + TRAIN: [0 1 2 3 4 5 8 9] TEST: [6 7] + + Fold: 5 + TRAIN: [0 1 2 3 4 5 6 7] TEST: [8 9] + ``` \ No newline at end of file diff --git a/one_exercise_per_file/week02/day05/ex02/audit/readme.md b/one_exercise_per_file/week02/day05/ex02/audit/readme.md new file mode 100644 index 0000000..d1ca44d --- /dev/null +++ b/one_exercise_per_file/week02/day05/ex02/audit/readme.md @@ -0,0 +1,16 @@ +1. This question is validated if the output is: + +```console +Scores on validation sets: + [0.62433594 0.61648956 0.62486602 0.59891024 0.59284295 0.61307055 + 0.54630341 0.60742976 0.60014575 0.59574508] + +Mean of scores on validation sets: + 0.60201392526743 + +Standard deviation of scores on validation sets: + 0.0214983822773466 + +``` + +The model is consistent across folds: it is stable. That's a first sign that the model is not over fitted. The average R2 is 60% that's a good start ! To be improved. diff --git a/one_exercise_per_file/week02/day05/ex02/readme.md b/one_exercise_per_file/week02/day05/ex02/readme.md new file mode 100644 index 0000000..09edfaf --- /dev/null +++ b/one_exercise_per_file/week02/day05/ex02/readme.md @@ -0,0 +1,53 @@ +# Exercise 2: Cross validation (k-fold) + +The goal of this exercise is to learn how to use cross validation. After reading the articles you should be able to explain why we need to cross-validate the models. We will firstly focus on Linear Regression to reduce the computation time. We will be using `cross_validate` to run the cross validation. Note that `cross_val_score` is similar but the `cross_validate` calculates one or more scores and timings for each CV split. + +Preliminary: + +- Import California Housing data set and split it in a train set and a test set (10%). Fit a linear regression on the data set. *The goal is to focus on the cross validation, that is why the code to fit the Linear Regression is given.* + +```python +# imports +from sklearn.datasets import fetch_california_housing +from sklearn.model_selection import train_test_split +from sklearn.linear_model import LinearRegression +from sklearn.preprocessing import StandardScaler +from sklearn.impute import SimpleImputer +from sklearn.pipeline import Pipeline + +# data +housing = fetch_california_housing() +X, y = housing['data'], housing['target'] +# split data train test +X_train, X_test, y_train, y_test = train_test_split(X, + y, + test_size=0.1, + shuffle=True, + random_state=43) +# pipeline +pipeline = [('imputer', SimpleImputer(strategy='median')), + ('scaler', StandardScaler()), + ('lr', LinearRegression())] +pipe = Pipeline(pipeline) +``` + +1. Cross validate the Pipeline using `cross_validate` with 10 folds. Print the scores on each validation sets, the mean score on the validation sets and the standard deviation on the validation sets. The expected output is: + +```console +Scores on validation sets: + [0.62433594 0.61648956 0.62486602 0.59891024 0.59284295 0.61307055 + 0.54630341 0.60742976 0.60014575 0.59574508] + +Mean of scores on validation sets: + 0.60201392526743 + +Standard deviation of scores on validation sets: + 0.0214983822773466 + + ``` + +**Note: It may be confusing that the key of the dictionary that returns the results on the validation sets is `test_score`. Sometimes, the validation sets are called test sets. In that case, we run the cross validation on X_train. It means that the scores are computed on sets in the initial train set. The X_test is not used for the cross-validation.** + +- https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html + +- https://machinelearningmastery.com/how-to-configure-k-fold-cross-validation/ \ No newline at end of file diff --git a/one_exercise_per_file/week02/day05/ex03/audit/readme.md b/one_exercise_per_file/week02/day05/ex03/audit/readme.md new file mode 100644 index 0000000..ff20dc2 --- /dev/null +++ b/one_exercise_per_file/week02/day05/ex03/audit/readme.md @@ -0,0 +1,33 @@ +1. This question is validated if the code that runs the grid search is similar to: + +```python +parameters = {'n_estimators':[10, 50, 75], + 'max_depth':[4, 7, 10]} + +rf = RandomForestRegressor() +gridsearch = GridSearchCV(rf, + parameters, + cv = 5, + n_jobs=-1, + scoring='neg_mean_squared_error') + +gridsearch.fit(X_train, y_train) +``` + +The answers that uses another list of parameters are accepted too ! + +2. This question is validated if you called this attributes: + +```python +print(gridsearch.best_score_) +print(gridsearch.best_params_) +print(gridsearch.cv_results_) +``` + +The best score is -0.29028202683007526, that means that the MSE is ~0.29, it doesn't give any information since this metric is arbitrary. This score is the average of `neg_mean_squared_error` on all the validation sets. + +The best models params are `{'max_depth': 10, 'n_estimators': 75}`. + +As you may must have a different parameters list than this one, you should have different results. + +3. This question is validated if you used the fitted estimator to compute the score on the test set: `gridsearch.score(X_test, y_test)`. The MSE score is ~0.27. The score I got on the test set is close to the score I got on the validation sets. It means the models is not over fitted. \ No newline at end of file diff --git a/one_exercise_per_file/week02/day05/ex03/readme.md b/one_exercise_per_file/week02/day05/ex03/readme.md new file mode 100644 index 0000000..a1f3308 --- /dev/null +++ b/one_exercise_per_file/week02/day05/ex03/readme.md @@ -0,0 +1,49 @@ +# Exercise 3 GridsearchCV + +The goal of this exercise is to learn to use GridSearchCV to run a grid search, predict on the test set and score on the test set. + +Preliminary: + +- Import California Housing data set and split it in a train set and a test set (10%). Fit a linear regression on the data set. *The goal is to focus on the gridsearch, that is why the code to fit the Linear Regression is given.* + +```python +# imports +from sklearn.datasets import fetch_california_housing +from sklearn.model_selection import train_test_split +from sklearn.linear_model import LinearRegression +from sklearn.preprocessing import StandardScaler +from sklearn.impute import SimpleImputer +from sklearn.pipeline import Pipeline + +# data +housing = fetch_california_housing() +X, y = housing['data'], housing['target'] +# split data train test +X_train, X_test, y_train, y_test = train_test_split(X, + y, + test_size=0.1, + shuffle=True, + random_state=43) +# pipeline +pipeline = [('imputer', SimpleImputer(strategy='median')), + ('scaler', StandardScaler()), + ('lr', LinearRegression())] +pipe = Pipeline(pipeline) +``` + +1. Run `GridSearchCV` on all CPUs with 5 folds, MSE as score, Random Forest as model with: + +- max_depth between 1 and 20 (at least 3 values) +- n_estimators between 1 and 100 (at least 3 values) + +This may take few minutes to run. + +*Hint*: The name of the metric to put in the parameter `scoring` is `neg_mean_squared_error`. The smaller the MSE is, the better the model is. At the contrary, The greater the R2 is the better the model is. `GridSearchCV` chooses the best model by selecting the one that maximized the score on the validation sets. And, in mathematic, maximizing a function or minimizing its opposite is equivalent. More details: + +- https://stackoverflow.com/questions/21443865/scikit-learn-cross-validation-negative-values-with-mean-squared-error + +2. Extract the best fitted estimator, print its params, print its score on the validation set and print `cv_results_`. + +3. Compute the score the test set. + +**WARNING: If the score used in classification is the AUC, there is one rare case where the AUC may return an error or a warning: The fold contains only one class. In that case it can't be computed, by definition.** \ No newline at end of file diff --git a/one_exercise_per_file/week02/day05/ex04/audit/readme.md b/one_exercise_per_file/week02/day05/ex04/audit/readme.md new file mode 100644 index 0000000..571864d --- /dev/null +++ b/one_exercise_per_file/week02/day05/ex04/audit/readme.md @@ -0,0 +1,27 @@ +1. This question is validated if the outputted plot looks like: + +![alt text][logo_ex5q1] + +[logo_ex5q1]: ../images/w2_day5_ex5_q1.png "Validation curve " + +The code that generated the data in the plot is: + +```python +from sklearn.model_selection import validation_curve + +clf = RandomForestClassifier() +param_range = np.arange(1,30,2) +train_scores, test_scores = validation_curve(clf, + X, + y, + param_name="max_depth", + param_range=param_range, + scoring="roc_auc", + n_jobs=-1) +``` + +2. This question is validated if the output is + +![alt text][logo_ex5q2] + +[logo_ex5q2]: ../images/w2_day5_ex5_q2.png "Learning curve " diff --git a/one_exercise_per_file/week02/day05/ex04/images/w2_day5_ex5_q1.png b/one_exercise_per_file/week02/day05/ex04/images/w2_day5_ex5_q1.png new file mode 100644 index 0000000..54e9f3c Binary files /dev/null and b/one_exercise_per_file/week02/day05/ex04/images/w2_day5_ex5_q1.png differ diff --git a/one_exercise_per_file/week02/day05/ex04/images/w2_day5_ex5_q2.png b/one_exercise_per_file/week02/day05/ex04/images/w2_day5_ex5_q2.png new file mode 100644 index 0000000..133663c Binary files /dev/null and b/one_exercise_per_file/week02/day05/ex04/images/w2_day5_ex5_q2.png differ diff --git a/one_exercise_per_file/week02/day05/ex04/readme.md b/one_exercise_per_file/week02/day05/ex04/readme.md new file mode 100644 index 0000000..4bf93b5 --- /dev/null +++ b/one_exercise_per_file/week02/day05/ex04/readme.md @@ -0,0 +1,58 @@ +# Exercise 4 Validation curve and Learning curve + +The goal of this exercise is to learn to analyse the model's performance with two tools: + +- Validation curve +- Learning curve + +For this exercise we will use a dataset of 100k data points to give you an idea of the computation time you can expect during projects. + +Preliminary: + +- Using make_classification from sklearn, generate a binary data set with 100k data points and with 30 features. + +```python +X, y = make_classification(n_samples=100000, + n_features= 30, + n_informative=10, + flip_y=0.2 ) +``` + +1. Plot the validation curve, using all CPUs, with 5 folds. The goal is to focus again on max_depth between 1 and 20. +You may need to increase the window (example: between 1 and 50 ) if you notice that other values of max_depth could have returned better results. This may take few minutes. + +I do not expect that you implement all the plot from scratch, you'd better leverage the code here: + +- https://scikit-learn.org/stable/auto_examples/model_selection/plot_validation_curve + +The plot should look like this: + +![alt text][logo_ex5q1] + +[logo_ex5q1]: images/w2_day5_ex5_q1.png "Validation curve " + +The interpretation is that from max_depth=10, the train score keeps increasing but the test score (or validation score) reaches a plateau. It means that choosing max_depth = 20 may lead to have an over fitted model. + +Note: Given the time computation is is not possible to plot the validation curve for all parameters. It is useful to plot it for parameters that control the over fitting the most. + +More details: + +- https://chrisalbon.com/machine_learning/model_evaluation/plot_the_validation_curve/ + +2. Let us assume the gridsearch returned `clf = RandomForestClassifier(max_depth=12)`. Let's check if the models under fits, over fit or fits correctly. Plot the learning curve. These two resources will help you a lot to understand how to analyse the learning curves and how to plot them: + +- https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/ + +- https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html#sphx-glr-auto-examples-model-selection-plot-learning-curve-py + +- **Re-use the function in the second resource**, change the cross validation to a classic 10-folds, run the learning curve data computation on all CPUs and plot the three plots as shown below. + +![alt text][logo_ex5q2] + +[logo_ex5q2]: images/w2_day5_ex5_q2.png "Learning curve " + +- **Note Plot Learning Curves**: The learning curves is detailed in the first resource. + +- **Note Plot Scalibility of the model**: This plot shows the relationship between the time to train the model and the number of rows in the data. In that case the relationship is linear. + +- **Note Performance of the model**: This plot shows wether it worths increasing the training time by adding data to increase the score. It would worth to add data to increase the score if the curve hasn't reach a plateau yet. In that case, increasing the training time by 10 units increases the score by less than 0.001. \ No newline at end of file diff --git a/one_exercise_per_file/week02/day05/readme.md b/one_exercise_per_file/week02/day05/readme.md index e69de29..d8c68e8 100644 --- a/one_exercise_per_file/week02/day05/readme.md +++ b/one_exercise_per_file/week02/day05/readme.md @@ -0,0 +1,26 @@ +# D05 Piscine AI - Data Science + +# Table of Contents: + +# Introduction + +If you finished yesterday's exercises you should be able to train several Machine Learning algorithms and to choose one returned by GridSearchCV. +GridSearchCV returns the model that gives the best score on the test set. Yesterday, as I told you, I changed the **cv** parameter to compute the GridSearch with a train set and a test set. + +It means that the selected model is based on one single measure. What if, by luck, we predict correctly on that section ? What if the best model is bad ? What if I could have selected a better model ? + +We will answer these questions today ! The topics we will cover are the one of the most important in Machine Learning. + +Must read before to start the exercises: + +- Biais-Variance trade off; aka Underfitting/Overfitting: + - https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine-learning/ + + - https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html + +- Cross-validation + - https://algotrading101.com/learn/train-test-split/ + +## Rules + +## Resources diff --git a/one_exercise_per_file/week02/raid02/audit/readme.md b/one_exercise_per_file/week02/raid02/audit/readme.md index e69de29..a7ccc64 100644 --- a/one_exercise_per_file/week02/raid02/audit/readme.md +++ b/one_exercise_per_file/week02/raid02/audit/readme.md @@ -0,0 +1,111 @@ +# Forest Cover Type Prediction - Correction + +The goal of this project is to use cartographic variables to classify forest categories. You will have to analyse the data, create features and to train a machine learning model on the cartographic data to make it as accurate as possible. + + + + +## Problem + + +The expected structure of the project is: + +``` +project +│ README.md +│ environment.yml +│ +└───data +│ │ train.csv +│ | test.csv (not available first day) +| | covtype.info +│ +└───notebook +│ │ EDA.ipynb +| +|───scripts +| │ preprocessing_feature_engineering.py +| │ model_selection.py +│ | predict.py +│ +└───results + │ confusion_matrix_heatmap.png + │ learning_curve_best_model.png + │ test_predictions.csv + │ best_model.pkl + +``` + +- The readme file contains a description of the project and explains how to run the code from an empty environment. It also gives a summary of the implementation of each python file. The preprocessing which is a key part should be decribed precisely. Finally, it should contain a conclusion that gives the performance of the strategy. + +- The environment has to contain all libraries used and their versions that are necessary to run the code. + +- The notebook is not evaluated. + + +## 1. Preprocessing and features engineering: + + + +## 2. Model selection and predict + +### Data splitting + +The data splitting structure is: + +``` +DATA +└───TRAIN FILE (0) +│ └───── Train (1): +│ | Fold0: +| | Train +| | Validation +| | Fold1: +| | Train +| | Validation +... ... ... +| | +| └───── Test (1) +│ +└─── TEST FILE (0)(available last day) + +``` + +- The train set (0) id divised in a train set (1) and test set (1). The ratio is less than 33%. +- The cross validation splits the train set (1) is at least 5 folds. If the cross validation is stratified that's a good point but it is not a requirement. + +### Gridsearch + +- It contains at least these 5 different models: + - Gradient Boosting, KNN, Random Forest, SVM, Logistic Regression. + +There are many options: +- 5 grid searches on 1 model +- 1 grid search on 5 models +- 1 grid search on a pipeline that contains the preprocessing +- 5 grid searches on a pipeline that contains the preprocessing + +### Training + +- Check that the **target is removed from the X** matrix + +### Results +Run predict.py on the test set, check that: + - Test (last day) accuracy > **0.65**. + +Then, check: +- Train accuracy score < **0.98**. It can be checked on the learning curve. If you are not sure, load the model, load the training set (0), score on the training set (0). +- The confusion matrix is represented as a DataFrame. Example: +![alt text][confusion_matrix] + +[confusion_matrix]: ../images/w2_weekend_confusion_matrix.png "Confusion matrix " + +- The learning curve for the best model is plotted. Example: + +![alt text][logo_learning_curve] + +[logo_learning_curve]: ../images/w2_weekend_learning_curve.png "Learning curve " + +Note: The green line on the plot shows the accuracy on the validation set not on the test set (1) and not on the test set (0). + +- The trained model is saved as a pickle file \ No newline at end of file diff --git a/one_exercise_per_file/week02/raid02/images/w2_weekend_confusion_matrix.png b/one_exercise_per_file/week02/raid02/images/w2_weekend_confusion_matrix.png new file mode 100644 index 0000000..f96b37e Binary files /dev/null and b/one_exercise_per_file/week02/raid02/images/w2_weekend_confusion_matrix.png differ diff --git a/one_exercise_per_file/week02/raid02/images/w2_weekend_learning_curve.png b/one_exercise_per_file/week02/raid02/images/w2_weekend_learning_curve.png new file mode 100644 index 0000000..e25120e Binary files /dev/null and b/one_exercise_per_file/week02/raid02/images/w2_weekend_learning_curve.png differ diff --git a/one_exercise_per_file/week02/raid02/readme.md b/one_exercise_per_file/week02/raid02/readme.md index e69de29..15ef755 100644 --- a/one_exercise_per_file/week02/raid02/readme.md +++ b/one_exercise_per_file/week02/raid02/readme.md @@ -0,0 +1,101 @@ +# Forest Cover Type Prediction + +The goal of this project is to use cartographic variables to classify forest categories. You will have to analyse the data, create features and to train a machine learning model on the cartographic data to make it as accurate as possible. + +## Data + +The input files are `train.csv`, `test.csv` and `covtype.data`: + +- `train.csv` +- `test.csv` +- `covtype.info` + +The train data set is used to **analyse the data and calibrate the models**. The goal is to get the accuracy as high as possible on the test set. The test set will be available at the end of the last day to prevent from the overfitting of the test set. + +The data is described in `covtype.info`. + +## Structure + +The structure of the project is: + +```console +project +│ README.md +│ environment.yml +│ +└───data +│ │ train.csv +│ | test.csv (not available first day) +| | covtype.info +│ +└───notebook +│ │ EDA.ipynb +| +|───scripts +| │ preprocessing_feature_engineering.py +| │ model_selection.py +│ | predict.py +│ +└───results + │ plots + │ test_predictions.csv + │ best_model.pkl + +``` + +## 1. EDA and feature engineering + +- Create a Jupyter Notebook to analyse the data sets and perform EDA (Exploratory Data Analysis). This notebook will not be evaluated. + +- *Hint: Examples of interesting features* + + - `Distance to hydrology = sqrt((Horizontal_Distance_To_Hydrology)^2 + (Vertical_Distance_To_Hydrology)^2)` + - `Horizontal_Distance_To_Fire_Points - Horizontal_Distance_To_Roadways` + +## 2. Model Selection + +The model selection approach is a key step because, t should return the best model and guaranty that the results are reproducible on the final test set. The goal of this step is to make sure that the results on the test set are not due to test set overfitting. It implies to split the data set as shown below: + +```console +DATA +└───TRAIN FILE (0) +│ └───── Train (1) +│ | Fold0: +| | Train +| | Validation +| | Fold1: +| | Train +| | Validation +... ... ... +| | +| └───── Test (1) +│ +└─── TEST FILE (0) (available last day) + +``` + +**Rules:** + +- Split train test +- Cross validation: at least 5 folds +- Grid search on at least 5 different models: + - Gradient Boosting, KNN, Random Forest, SVM, Logistic Regression. *Remember that for some model scaling the data is important and for others it doesn't matter.* + +- Train accuracy score < **0.98**. Train set (0). Write the result in the `README.md` +- Test (last day) accuracy > **0.65**. Test set (0). Write the result in the `README.md` +- Display the confusion matrix for the best model in a DataFrame. Precise the index and columns names (True label and Predicted label) +- Plot the learning curve for the best model +- Save the trained model as a [pickle](https://www.datacamp.com/community/tutorials/pickle-python-tutorial) file + +> Advice: As the grid search takes time, I suggest to prepare and test the code. Once you are confident it works, run the gridsearch at night and analyse the results + +**Hint**: The confusion matrix shows the misclassifications class per class. Try to detect if the model misclassifies badly one class with another. Then, do some research on the internet on the two forest cover types, find the differences and create some new features that underline these differences. More generally, the methodology of a models learning is a cycle with several iterations. More details [here](https://serokell.io/blog/machine-learning-testing) + +## 3. Predict (last day) + +Once you have selected the best model and you are confident it will perform well on new data, you're ready to predict on the test set: + +- Load the trained model +- Predict on the test set and compute the accuracy +- Save the predictions in a csv file +- Add your score on the `README.md` \ No newline at end of file