Browse Source

feat(ai-branch): add ai specialist branch

- reformatted subjects/audits according to new conventions
master
Tiago Collot 1 year ago committed by Tiago Collot
parent
commit
0e99071906
  1. 155
      subjects/ai/backtesting-on-the-sp500/README.md
  2. 112
      subjects/ai/backtesting-on-the-sp500/audit/README.md
  3. 9032
      subjects/ai/backtesting-on-the-sp500/data/fundamentals.csv
  4. 3771
      subjects/ai/backtesting-on-the-sp500/data/sp500.csv
  5. 3645
      subjects/ai/backtesting-on-the-sp500/data/stock_prices.csv
  6. BIN
      subjects/ai/backtesting-on-the-sp500/images/w1_weekend_plot_pnl.png
  7. 359
      subjects/ai/classification-with-scikit-learn /README.md
  8. 220
      subjects/ai/classification-with-scikit-learn /audit/README.md
  9. 699
      subjects/ai/classification-with-scikit-learn /data/breast-cancer-wisconsin.data
  10. 126
      subjects/ai/classification-with-scikit-learn /data/breast-cancer-wisconsin.names
  11. BIN
      subjects/ai/classification-with-scikit-learn /w2_day2_ex2_q1.png
  12. BIN
      subjects/ai/classification-with-scikit-learn /w2_day2_ex3_q1.png
  13. BIN
      subjects/ai/classification-with-scikit-learn /w2_day2_ex3_q3.png
  14. BIN
      subjects/ai/classification-with-scikit-learn /w2_day2_ex3_q5.png
  15. BIN
      subjects/ai/classification-with-scikit-learn /w2_day2_ex3_q6.png
  16. 117
      subjects/ai/credit-scoring/README.md
  17. 87
      subjects/ai/credit-scoring/audit/README.md
  18. BIN
      subjects/ai/credit-scoring/data_description.png
  19. 46
      subjects/ai/credit-scoring/readme_data.md
  20. 302
      subjects/ai/data-wrangling-with-pandas/README.md
  21. 168
      subjects/ai/data-wrangling-with-pandas/audit/README.md
  22. 152
      subjects/ai/emotions-detector/README.md
  23. 131
      subjects/ai/emotions-detector/audit/README.md
  24. 105
      subjects/ai/forest-cover-type-prediction/README.md
  25. 110
      subjects/ai/forest-cover-type-prediction/audit/README.md
  26. BIN
      subjects/ai/forest-cover-type-prediction/images/w2_weekend_confusion_matrix.png
  27. BIN
      subjects/ai/forest-cover-type-prediction/images/w2_weekend_learning_curve.png
  28. 115
      subjects/ai/kaggle-titanic/README.md
  29. 53
      subjects/ai/kaggle-titanic/audit/README.md
  30. BIN
      subjects/ai/kaggle-titanic/titanic.jpg
  31. 141
      subjects/ai/keras-2/README.md
  32. 178
      subjects/ai/keras-2/audit/README.md
  33. 399
      subjects/ai/keras-2/auto-mpg.csv
  34. 136
      subjects/ai/keras/README.md
  35. 172
      subjects/ai/keras/audit/README.md
  36. 325
      subjects/ai/linear-regression-with-scikit-learn/README.md
  37. 223
      subjects/ai/linear-regression-with-scikit-learn/audit/README.md
  38. BIN
      subjects/ai/linear-regression-with-scikit-learn/w2_day01_linear_regression_video.gif
  39. BIN
      subjects/ai/linear-regression-with-scikit-learn/w2_day1_ex2_q1.png
  40. BIN
      subjects/ai/linear-regression-with-scikit-learn/w2_day1_ex2_q3.png
  41. BIN
      subjects/ai/linear-regression-with-scikit-learn/w2_day1_ex5_q1.png
  42. BIN
      subjects/ai/linear-regression-with-scikit-learn/w2_day1_ex5_q5.png
  43. BIN
      subjects/ai/linear-regression-with-scikit-learn/w2_day1_ex5_q6.png
  44. BIN
      subjects/ai/linear-regression-with-scikit-learn/w2_day1_ex5_q8.png
  45. 366
      subjects/ai/machine-learning-pipeline/README.md
  46. 202
      subjects/ai/machine-learning-pipeline/audit/README.md
  47. 286
      subjects/ai/machine-learning-pipeline/data/breast-cancer.csv
  48. 73
      subjects/ai/machine-learning-pipeline/data/breast_cancer_readme.txt
  49. 259
      subjects/ai/model-selection-methodology/README.md
  50. 131
      subjects/ai/model-selection-methodology/audit/README.md
  51. BIN
      subjects/ai/model-selection-methodology/w2_day5_ex5_q1.png
  52. BIN
      subjects/ai/model-selection-methodology/w2_day5_ex5_q2.png
  53. 162
      subjects/ai/natural-language-processing-with-spacy/README.md
  54. 148
      subjects/ai/natural-language-processing-with-spacy/audit/README.md
  55. 18
      subjects/ai/natural-language-processing-with-spacy/resources/news_amazon.txt
  56. BIN
      subjects/ai/natural-language-processing-with-spacy/w3day05ex1_plot.png
  57. 218
      subjects/ai/natural-language-processing/README.md
  58. 208
      subjects/ai/natural-language-processing/audit/README.md
  59. 274
      subjects/ai/neural-networks/README.md
  60. 75
      subjects/ai/neural-networks/audit/README.md
  61. BIN
      subjects/ai/neural-networks/w3_day1_neural_network.png
  62. BIN
      subjects/ai/neural-networks/w3_day1_neuron.png
  63. 736
      subjects/ai/nlp-scraper/BBC News Test.csv
  64. 1491
      subjects/ai/nlp-scraper/BBC News Train.csv
  65. 176
      subjects/ai/nlp-scraper/README.md
  66. 112
      subjects/ai/nlp-scraper/audit/README.md
  67. 255
      subjects/ai/numpy/README.md
  68. 314
      subjects/ai/numpy/audit/README.md
  69. 10
      subjects/ai/numpy/data/model_forecasts.txt
  70. 1600
      subjects/ai/numpy/data/winequality-red.csv
  71. 72
      subjects/ai/numpy/data/winequality.names
  72. 173
      subjects/ai/pandas/README.md
  73. 230
      subjects/ai/pandas/audit/README.md
  74. 20001
      subjects/ai/pandas/data/Ecommerce_purchases.txt
  75. 151
      subjects/ai/pandas/data/iris.csv
  76. 152
      subjects/ai/pandas/data/iris.data
  77. 212
      subjects/ai/sp500-strategies/README.md
  78. BIN
      subjects/ai/sp500-strategies/Time_series_split.png
  79. 142
      subjects/ai/sp500-strategies/audit/README.md
  80. BIN
      subjects/ai/sp500-strategies/blocking_time_series_split.png
  81. BIN
      subjects/ai/sp500-strategies/metric_plot.png
  82. 172
      subjects/ai/time-series-with-pandas/README.md
  83. 185
      subjects/ai/time-series-with-pandas/audit/README.md
  84. 10120
      subjects/ai/time-series-with-pandas/data/AAPL.csv
  85. 300
      subjects/ai/train-and-evalute-machine-learning-models/README.md
  86. 249
      subjects/ai/train-and-evalute-machine-learning-models/audit/README.md
  87. BIN
      subjects/ai/train-and-evalute-machine-learning-models/w2_day4_ex4_q3.png
  88. 271
      subjects/ai/visualizations/README.md
  89. 186
      subjects/ai/visualizations/audit/README.md
  90. BIN
      subjects/ai/visualizations/w1day03_ex1_plot1.png
  91. BIN
      subjects/ai/visualizations/w1day03_ex2_plot1.png
  92. BIN
      subjects/ai/visualizations/w1day03_ex3_plot1.png
  93. BIN
      subjects/ai/visualizations/w1day03_ex4_plot1.png
  94. BIN
      subjects/ai/visualizations/w1day03_ex5_plot1.png
  95. BIN
      subjects/ai/visualizations/w1day03_ex6_plot1.png
  96. BIN
      subjects/ai/visualizations/w1day03_ex7_plot1.png

155
subjects/ai/backtesting-on-the-sp500/README.md

@ -0,0 +1,155 @@
# Backtesting on the SP500
## SP500 data preprocessing
The goal of this project is to perform a Backtest on the SP500 constituents. The SP500 is an index the 500 biggest capitalization in the US.
## Data
The input file are `stock_prices.csv` and:
- `sp500.csv` contains the SP500 data. The SP500 is a stock market index that measures the stock performance of 500 large companies listed on stock exchanges in the United States.
- `stock_prices.csv`: contains the close prices for all the companies that had been in the SP500. It contains a lot of missing data. The adjusted close price may be unavailable for three main reasons:
- The company doesn't exist at date d
- The company is not public, pas coté
- Its close price hasn't been reported
- Note: The quality of this data set is not good: some prices are wrong, there are some prices spikes, there are some prices adjustments (share split, dividend distribution) - the prices adjustment are corrected in the adjusted close. But I'm not providing this data for this project to let you understand what is bad quality data and how important it is to detect outliers and missing values. The idea is not to correct the full data set manually, but to correct the main problems.
_Note: The corrections will not fix the data, as a result the results may be abnormal compared to results from cleaned financial data. That's not a problem for this small project !_
## Problem
Once preprocessed this data, it will be used to generate a signal that is, for each asset at each date a metric that indicates if the asset price will increase the next month. At each date (once a month) we will take the 20 highest metrics and invest 1$ per company. This strategy is called **stock picking**. It consists in picking stock in an index and try to overperform the index. Finally we will compare the performance of our strategy compared to the benchmark: the SP500
It is important to understand that the SP500 components change over time. The reason is simple: Facebook entered the SP500 in 2013 thus meaning that another company had to be removed from the 500 companies.
The structure of the project is:
```console
project
│ README.md
│ environment.yml
└───data
│ │ sp500.csv
│ | prices.csv
└───notebook
│ │ analysis.ipynb
|
|───scripts
| │ memory_reducer.py
| │ preprocessing.py
| │ create_signal.py
| | backtester.py
│ | main.py
└───results
│ plots
│ results.txt
│ outliers.txt
```
There are four parts:
## 1. Preliminary
- Create a function that takes as input one CSV data file. This function should optimize the types to reduce its size and returns a memory optimized DataFrame.
- For `float` data the smaller data type used is `np.float32`
- These steps may help you to implement the memory_reducer:
1. Iterate over every column
2. Determine if the column is numeric
3. Determine if the column can be represented by an integer
4. Find the min and the max value
5. Determine and apply the smallest datatype that can fit the range of values
## 2. Data wrangling and preprocessing
- Create a Jupyter Notebook to analyse the data sets and perform EDA (Exploratory Data Analysis). This notebook should contain at least:
- Missing values analysis
- Outliers analysis (there are a lot of outliers)
- One of average price for companies for all variables (save the plot with the images).
- Describe at least 5 outliers ('ticker', 'date', 'price'). Put them in `outliers.txt` file with the 3 fields on the folder `results`.
_Note: create functions that generate the plots and save them in the images folder. Add a parameter `plot` with a default value `False` which doesn't return the plot. This will be useful for the correction to let people run your code without overriding your plots._
- Here is how the `prices` data should be preprocessed:
- Resample data on month and keep the last value
- Filter prices outliers: Remove prices outside of the range 0.1$, 10k$
- Compute monthly returns:
- Historical returns. **returns(current month) = price(current month) - price(previous month) / price(previous month)**
- Future returns. **returns(current month) = price(next month) - price(current month) / price(current month)**
- Replace returns outliers by the last value available regarding the company. This corrects prices spikes that corresponds to a monthly return greater than 1 and smaller than -0.5. This correction should not consider the 2008 and 2009 period as the financial crisis impacted the market brutally. **Don't forget that a value is considered as an outlier comparing to the other returns/prices of the same company**
At this stage the DataFrame should looks like this:
| | Price | monthly_past_return | monthly_future_return |
| :--------------------------------------------------- | ------: | ------------------: | -------------------: |
| (Timestamp('2000-12-31 00:00:00', freq='M'), 'A') | 36.7304 | nan | -0.00365297 |
| (Timestamp('2000-12-31 00:00:00', freq='M'), 'AA') | 25.9505 | nan | 0.101194 |
| (Timestamp('2000-12-31 00:00:00', freq='M'), 'AAPL') | 1.00646 | nan | 0.452957 |
| (Timestamp('2000-12-31 00:00:00', freq='M'), 'ABC') | 11.4383 | nan | -0.0528713 |
| (Timestamp('2000-12-31 00:00:00', freq='M'), 'ABT') | 38.7945 | nan | -0.07205 |
- Fill the missing values using the last available value (same company)
- Drop the missing values that can't be filled
- Print `prices.isna().sum()`
- Here is how the `sp500.csv` data should be preprocessed:
- Resample data on month and keep the last value
- Compute historical monthly returns on the adjusted close
## 3. Create signal
At this stage we have a data set with features that we will leverage to get an investment signal. As previously said, we will focus on one single variable to create the signal: **monthly_past_return**. The signal will be the average of monthly returns of the previous year
The naive assumption made here is that if a stock has performed well the last year it will perform well the next month. Moreover, we assume that we can buy stocks as soon as we have the signal (the signal is available at the close of day `d` and we assume that we can buy the stock at close of day `d`. The assumption is acceptable while considering monthly returns, because the difference between the close of day `d` and the open of day `d+1` is small comparing to the monthly return)
- Create a column `average_return_1y`
- Create a column named `signal` that contains `True` if `average_return_1y` is among the 20 highest in the month `average_return_1y`.
## 4. Backtester
At this stage we have an investment signal that indicates each month what are the 20 companies we should invest 1$ on (1$ each). In order to check the strategies and performance we will backtest our investment signal.
- Compute the PnL and the total return of our strategy without a for loop. Save the results in a text file `results.txt` in the folder `results`.
- Compute the PnL and the total return of the strategy that consists in investing 20$ each day on the SP500. Compare. Save the results in a text file `results.txt` in the folder `results`.
- Create a plot that shows the performance of the strategy over time for the SP500 and the Stock Picking 20 strategy.
A data point (x-axis: date, y-axis: cumulated_return) is: the **cumulated returns** from the beginning of the strategy at date `t`. Save the plot in the results folder.
> This plot is used a lot in Finance because it helps to compare a custom strategy with in index. In that case we say that the SP500 is used as **benchmark** for the Stock Picking Strategy.
![alt text][performance]
[performance]: images/w1_weekend_plot_pnl.png 'Cumulative Performance'
## 5. Main
Here is a sketch of `main.py`.
```python
# main.py
# import data
prices, sp500 = memory_reducer(paths)
# preprocessing
prices, sp500 = preprocessing(prices, sp500)
# create signal
prices = create_signal(prices)
#backtest
backtest(prices, sp500)
```
**The command `python main.py` executes the code from data imports to the backtest and save the results.**

112
subjects/ai/backtesting-on-the-sp500/audit/README.md

@ -0,0 +1,112 @@
# Backtesting on the SP500 - audit
### Preliminary
###### Does the structure of the project is as below ?
```
project
│ README.md
│ environment.yml
└───data
│ │ sp500.csv
│ | prices.csv
└───notebook
│ │ analysis.ipynb
|
|───scripts
| │ memory_reducer.py
| │ preprocessing.py
| │ create_signal.py
| | backtester.py
│ | main.py
└───results
│ plots
│ results.txt
│ outliers.txt
```
###### Does the readme file contain a description of the project, explain how to run the code from an empty environment, give a summary of the implementation of each python file and contain a conclusion that gives the performance of the strategy?
###### Does the environment contain all libraries used and their versions that are necessary to run the code ?
###### Does the notebook contain a missing values analysis? **Example**: number of missing values per variables or per year
###### Does the notebook contain an outliers analysis?
###### Does the notebook contain a Histogram of average price for companies for all variables (saved the plot with the images) ? This is required only for `prices.csv` data.
###### Does the notebook describe at least 5 outliers ('ticker', 'date', price) ? To check the outliers it is simple: Search the historical stock price on Google at the given date and compare. The price may fluctuate a bit. The goal here is not to match the historical price found on Google but to detect a huge difference between the price in our data and the real historical one.
Notes:
- For all questions always check the values are sorted by date. If not the answers are wrong.
- The plots are validated only if they contain a title
### Python files
#### 1. memory_reducer.py
###### Do the `prices` data set weights less than **8MB** (Mega Bytes) ?
###### Do the `sp500` data set weights less than **0.15MB** (Mega Bytes) ?
###### Do the data type is greater than `np.float32` ? Smaller data type may alter the precision of the data.
#### 2. preprocessing.py
##### The data is agregated on a monthly period and only the last element is kept
##### The outliers are filtered out by removing all prices bigger than 10k $ and smaller than 0.1 $
##### The historical return is computed using only current and past values.
##### The futur return is computed using only current and futur value. (Reminder: as the data is resampled monthly, computing the return is straightforward)
##### The outliers in the returns data is set to NaN for all returns not in the years 2008 and 2009. The filters are: return > 1 and return < -0.5.
##### The missing values are filled using the last value available **for the company**. `df.fillna(method='ffill')` is wrong because the previous value can be the return or price of another company.
##### The missing values that can't be filled using a the previous existing value are dropped.
##### The number of missing values is 0
Best practice:
Do not fill the last values for the futur return because the values are missing because the data set ends at a given date. Filling the previous doesn't make sense. It makes more sense to drop the row because the backtest focuses on observed data.
#### 3. create_signal.py
##### The metric `average_return_1y` is added as a new column if the merged DataFrame. The metric is relative to a company. It is important to group the data by company first before to compute the average return over 1y. It is accepted to consider that one year is 12 consecutive rows.
##### The signal is added as a new column to the merged DataFrame. The signal which is boolean indicates whether, within the same month, the company is in the top 20. The top 20 corresponds to the 20 companies with the 20 highest metric within the same month. The highest metric gets the rank 1 (if rank is used the parameter `ascending` should be set to `False`).
#### 4. backtester.py
##### The PnL is computed by multiplying the signal `Series` by the **futur returns**.
##### The return of the strategy is computed by dividing the PnL by the sum of the signal `Series`.
##### The signal used on the SP500 is the `pd.Series([20,20,...,20])`
##### The series used in the plot are the cumulative PnL. `cumsum` can be used
##### The PnL on the full historical data is **smaller than 75$**. If not, it means that the outliers where not corrected correctly.
###### Does the plot contain a title ?
###### Does the plot contain a legend ?
###### Does the plot contain a x-axis and y-axis name ?
![alt text][performance]
[performance]: ../images/w1_weekend_plot_pnl.png "Cumulative Performance"
#### 5. main.py
###### The command `python main.py` executes the code from data imports to the backtest and save the results? It shouldn't return any error to validate the project.

9032
subjects/ai/backtesting-on-the-sp500/data/fundamentals.csv

File diff suppressed because it is too large diff.load

3771
subjects/ai/backtesting-on-the-sp500/data/sp500.csv

File diff suppressed because it is too large diff.load

3645
subjects/ai/backtesting-on-the-sp500/data/stock_prices.csv

diff.file_suppressed_line_too_long

BIN
subjects/ai/backtesting-on-the-sp500/images/w1_weekend_plot_pnl.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 136 KiB

359
subjects/ai/classification-with-scikit-learn /README.md

@ -0,0 +1,359 @@
# Classification with Scikit Learn
The goal of this day is to understand practical classification.
Today we will learn a different approach in Machine Learning: the classification which is a large domain in the field of statistics and machine learning. Generally, it can be broken down in two areas:
- **Binary classification**, where we wish to group an outcome into one of two groups.
- **Multi-class classification**, where we wish to group an outcome into one of multiple (more than two) groups.
You may wonder why the approach is different from regression and why we don't use regression and define a threshold from where the class would 1 else 0 - in binary classification.
The main reason is that the linear regression is sensitive to outliers, hence the treshold would vary depending on the outliers in the data. The article mentioned explains this reason with plots. To keep things simple, we can say that the output needed in classification is a probability to belong to one of the classes. So, by definition the value output by the classification model has to be between 0 and 1. The linear regression can't satisfy this constraint.
In mathematics, there are functions with nice properties that take as input a real (-inf, inf) and output a value between 0 and 1, the most popular of them is the **sigmoid** - which is the inverse function of the logit, hence the name logistic regression.
Let's take a small example to have a better understanding of the steps needed to perform a logistic regression on a binary data. Let's assume that we want to predict the gender given the people' size (height).
Logistic regression steps:
- Fit a sigmoid on the training data
- Compute sigmoid(size)=0.7 because the sigmoid returns values between 0 and 1
- Return the class: 0.7 > 0.5 => class 1. Thus, the gender is male
For the linear regression exercises, the loss (Mean Square Error - MSE) is minimized with an algorithm called **gradient descent**. In the classification, the loss MSE can't be used because the output of the model is 0 or 1 (for binary classification).
The **logloss** or **cross entropy** is the loss used for classification. Similarly, it has some nice mathematical properties. The minimization of the **logloss** is not covered in the exercises. However, since it is used in most machine learning models for classification, I recommend to spend some time reading the related article.
### Exercises of the day
- Exercise 0: Environment and libraries
- Exercise 1: Logistic regression with Scikit-learn
- Exercise 2: Sigmoid
- Exercise 3: Decision boundary
- Exercise 4: Train test split
- Exercise 5: Breast Cancer prediction
- Exercise 6 Multi-class (**Optional**)
### Virtual Environment
- Python 3.x
- NumPy
- Pandas
- Matplotlib
- Scikit Learn
- Jupyter or JupyterLab
_Version of Scikit Learn I used to do the exercises: 0.22_. I suggest to use the most recent one. Scikit Learn 1.0 is finally available after ... 14 years.
### **Resources**
### Logistic regression
- https://towardsdatascience.com/understanding-logistic-regression-9b02c2aec102
### Logloss
- https://towardsdatascience.com/cross-entropy-for-classification-d98e7f974451
- https://medium.com/swlh/what-is-logistic-regression-62807de62efa
---
---
# Exercise 0: Environment and libraries
The goal of this exercise is to set up the Python work environment with the required libraries.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
I recommend to use:
- the **last stable versions** of Python.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy`, `jupyter`, `matplotlib` and `scikit-learn`.
---
---
# Exercise 1: Logistic regression in Scikit-learn
The goal of this exercise is to learn to use Scikit-learn to classify data.
```python
X = [[0],[0.1],[0.2], [1],[1.1],[1.2], [1.3]]
y = [0,0,0,1,1,1,0]
```
1. Predict the class for `x_pred = [[0.5]]`.
2. Predict the probabilities for `x_pred = [[0.5]]` using `predict_proba`.
3. Print the coefficients (`coef_`), the intercept (`intercept_`) and the score of the logistic regression of X and y.
---
---
# Exercise 2: Sigmoid
The goal of this exercise is to learn to compute and plot the sigmoid function.
1. On the same plot, plot the sigmoid function and the custom sigmoids defined as:
- `sigmoid1(x) = 1/(1+ exp(-(0.5*x + 3)))`
- `sigmoid2(x) = 1/(1+ exp(-(5*x + 11)))`
- Add a line representing the probability=0.5
The plot should look like this:
![alt text][ex2q1]
[ex2q1]: ./w2_day2_ex2_q1.png "Scatter plot"
---
---
# Exercise 3: Decision boundary
The goal of this exercise is to learn to fit a logistic regression on simple examples and to understand how the algorithm separated the data from the different classes.
## 1 dimension
First, we will start as usual with features data in 1 dimension. Use `make classification` from Scikit-learn to generate 100 data points:
```python
X,y = make_classification(
n_samples=100,
n_features=1,
n_informative=1,
n_redundant=0,
n_repeated=0,
n_classes=2,
n_clusters_per_class=1,
weights=[0.5,0.5],
flip_y=0.15,
class_sep=2.0,
hypercube=True,
shift=1.0,
scale=1.0,
shuffle=True,
random_state=88
)
```
_Warning: The shape of X is not the same as the shape of y. You may need (for some questions) to reshape X using: `X.reshape(1,-1)[0]`._
1. Plot the data using a scatter plot. The x-axis contains the feature and y-axis contains the target.
The plot should look like this:
![alt text][ex3q1]
[ex3q3]: ./w2_day2_ex3_q3.png "Scatter plot"
2. Fit a Logistic Regression on the generated data using scikit learn. Print the coefficients and the interception of the Logistic Regression.
3. Add to the previous plot the fitted sigmoid and the 0.5 probability line. The plot should look like this:
![alt text][ex3q3]
[ex3q1]: ./w2_day2_ex3_q1.png "Scatter plot + Logistic regression"
4. Create a function `predict_probability` that takes as input the data point and the coefficients and that returns the predicted probability. As a reminder, the probability is given by: `p(x) = 1/(1+ exp(-(coef*x + intercept)))`. Check you have the same results as the method `predict_proba` from Scikit-learn.
```python
def predict_probability(coefs, X):
'''
coefs is a list that contains a and b: [coef, intercept]
X is the features set
Returns probability of X
'''
#TODO
probabilities =
return probabilities
```
5. Create a function `predict_class` that takes as input the data point and the coefficients and that returns the predicted class. Check you have the same results as the class method `predict` output on the same data.
6. On the plot add the predicted class. The plot should look like this (the predicted class is shifted a bit to make the plot more understandable, but obviously the predicted class is 0 or 1, not 0.1 or 0.9)
The plot should look like this:
![alt text][ex3q6]
[ex3q6]: ./w2_day2_ex3_q5.png "Scatter plot + Logistic regression + predictions"
## 2 dimensions
Now, let us repeat this process on 2-dimensional data. The goal is to focus on the decision boundary and to understand how the Logistic Regression create a line that separates the data. The code to plot the decision boundary is provided, however it is important to understand the way it works.
- Generate 500 data points using:
```python
X, y = make_classification(n_features=2,
n_redundant=0,
n_samples=250,
n_classes=2,
n_clusters_per_class=1,
flip_y=0.05,
class_sep=3,
random_state=43)
```
7. Fit the Logistic Regression on X and y and use the code below to plot the fitted sigmoid on the data set.
The plot should look like this:
![alt text][ex3q7]
[ex3q7]: ./w2_day2_ex3_q6.png "Logistic regression decision boundary"
```python
xx, yy = np.mgrid[-5:5:.01, -5:5:.01]
grid = np.c_[xx.ravel(), yy.ravel()]
#if needed change the line below
probs = clf.predict_proba(grid)[:, 1].reshape(xx.shape)
f, ax = plt.subplots(figsize=(8, 6))
contour = ax.contourf(xx, yy, probs, 25, cmap="RdBu",
vmin=0, vmax=1)
ax_c = f.colorbar(contour)
ax_c.set_label("$P(y = 1)$")
ax_c.set_ticks([0, .25, .5, .75, 1])
ax.scatter(X[:,0], X[:, 1], c=y, s=50,
cmap="RdBu", vmin=-.2, vmax=1.2,
edgecolor="white", linewidth=1)
ax.set(aspect="equal",
xlim=(-5, 5), ylim=(-5, 5),
xlabel="$X_1$", ylabel="$X_2$")
```
The plot should look like this:
- https://stackoverflow.com/questions/28256058/plotting-decision-boundary-of-logistic-regression
---
---
# Exercise 4: Train test split
The goal of this exercise is to learn to split a classification data set. The idea is the same as splitting a regression data set but there's one important detail specific to the classification: the proportion of each class in the train set and test set.
```python
X = np.arange(1,21).reshape(10,-1)
y = np.zeros(10)
y[7:] = 1
```
1. Split the data using `train_test_split` with `shuffle=False`. The test set represents 20% of the total size of the data set. Print X_train, y_train, X_test, y_test. Compute the proportion of class `1` on the train set and test set.
2. Having a train set with different properties than the test set is not recommended. The analogy of the exam (https://www.youtube.com/watch?v=_vdMKioCXqQ) helps to understand this point: if the questions you have at the exam are completely different from what you prepared for you are not evaluated on what you learn. The training set has to be representative of the data set. Now, split the data in a train set and test set, but keep the proportion of class `1` nearly constant. The parameter `shuffle` in theory works as it relies on a random sampling. The parameter `stratify` will always split the data and keep the same proportion of class `1` in the train set and test set. Using the parameter `stratify` split the data below and print the proportion of class `1` in the train set and train set.
```python
X = np.arange(1,201).reshape(100,-1)
y = np.zeros(100)
y[70:] = 1
```
---
---
# Exercise 5: Breast Cancer prediction
The goal of this exercise is to use Logistic Regression
to predict breast cancer. It is always important to understand the data before training any Machine Learning algorithm. The data is described in **breast-cancer-wisconsin.names**. I suggest to add manually the column names in the DataFrame.
Preliminary:
- If needed, replace missing values with the median of the column.
- Handle the column `Sample code number`. This column won't be used to train the model as it doesn't contain information on breast cancer. There are two solutions: drop it or set it as index.
1. Print the proportion of class `Benign`. What would be the accuracy if the model always predicts `Benign`?
Later this week we will learn about other metrics as AUC that will help us to tackle high imbalanced data sets.
2. Using train_test_split, split the data set in a train set and test set (20%). Both sets should have approximately the same proportion of class `Benign`. Use `random_state = 43`.
3. Fit the logistic regression on the train set. Predict on the train set and test set. Compute the score on the train set and test set. 92-97% accuracy is expected on the test set.
4. Compute the confusion matrix on both tests. Analyse the number of false negative and false positive.
- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
- https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/
---
---
# Exercise 6: Multi-class (Optional)
The goal of this exercise is to learn to train a classification algorithm on a multi-class labelled data.
Some algorithms as SVM or Logistic Regression do not natively support multi-class (more than 2 classes). There are some approaches that allow to use these algorithms on multi-class data.
Let's assume we work with 3 classes: A, B and C.
- One-vs-Rest considers 3 binary classification problems: A vs B,C; B vs A,C and C vs A,B. If there are 10 classes, 10 binary classification problems would be fitted.
- One-vs-One considers 3 binary classification problems: A vs B, A vs C, B vs C. If there are 10 classes, 45 binary classification problems would be fitted. Given, the volume of data, this technique may not be scalable.
More details:
- https://machinelearningmastery.com/one-vs-rest-and-one-vs-one-for-multi-class-classification/
Let's implement the One-vs-Rest approach from `LogisticRegression`.
Preliminary:
- Import the Setosa data set from Scikit-learn
```python
from sklearn.datasets import load_iris
iris = load_iris()
X = pd.DataFrame(data=iris['data'], columns=iris.feature_names)
y = pd.DataFrame(data=iris['target'], columns=['target'])
```
- Using train_test_split, split the data set in a train set and test set (20%) with `shuffle=True` and `random_state=43`.
1. Create a function that takes as input the data and returns three **trained** classifiers.
- `clf0` takes as input a binary data set where the class 1 is `0` and class 0 is `1` and `2`.
- `clf1` takes as input a binary data set where the class 1 is `1` and class 0 is `0` and `2`.
- `clf2` takes as input a binary data set where the class 1 is `2` and class 0 is `0` and `1`.
```python
def train(X_train,y_train):
#TODO
return clf0, clf1, clf2
```
2. Create a function that takes as input the trained classifiers and the features set and that returns the predicted class. Use `predict_one_vs_all` to output the predicted classes on the test set. Compare the results with Logistic Regression algorithm from scikit learn used in One-vs-All mode. The results may change because the solver may not converge. Later this week, we will learn to preprocess the data to avoid convergence issues.
- `clf0` outputs the probability to belong to the class 1 which is `0`.
- `clf1` outputs the probability to belong to the class 1 which is `1`.
- `clf2` outputs the probability to belong to the class 1 which is `2`.
The predicted class is the one that gets the **highest probability** among the three models.
```python
def predict_one_vs_all(X, clf0, clf1, clf2 ):
#TODO
return classes
```
- https://randerson112358.medium.com/python-logistic-regression-program-5e1b32f964db
- https://towardsdatascience.com/logistic-regression-using-python-sklearn-numpy-mnist-handwriting-recognition-matplotlib-a6b31e2b166a

220
subjects/ai/classification-with-scikit-learn /audit/README.md

@ -0,0 +1,220 @@
#### Exercise 0: Environment and libraries
##### The exercice is validated is all questions of the exercice are validated.
##### Activate the virtual environment. If you used `conda` run `conda activate your_env`
##### Run `python --version`
###### Does it print `Python 3.x`? x >= 8
###### Does `import jupyter`, `import numpy`, `import pandas`, `import matplotlib` and `import sklearn` run without any error ?
---
---
#### Exercise 1: Logistic regression with Scikit-learn
##### The question 1 is validated if the predicted class is `0`.
##### The question 2 is validated if the predicted probabilities are `[0.61450526 0.38549474]`
##### The question 3 is validated if the output is:
```console
Coefficient:
[[0.81786797]]
Intercept:
[-0.87522391]
Score:
0.7142857142857143
```
---
---
#### Exercise 2: Sigmoid
##### The question 1 is validated if the plot looks like this:
![alt text][ex2q1]
[ex2q1]: ../w2_day2_ex2_q1.png "Scatter plot"
---
---
#### Exercise 3: Decision boundary
##### The exercice is validated is all questions of the exercice are validated
##### The question 1 is validated if the outputted plot looks like this:
![alt text][ex3q1]
[ex3q1]: ../w2_day2_ex3_q1.png "Scatter plot"
##### The question 2 is validated if the coefficient and the intercept of the Logistic Regression are:
```console
Intercept: [-0.98385574]
Coefficient: [[1.18866075]]
```
##### The question 3 is validated if the plot looks like this:
![alt text][ex3q2]
[ex3q2]: ../w2_day2_ex3_q3.png "Scatter plot"
##### The question 4 is validated if `predict_probability` outputs the same probabilities as `predict_proba`. Note that the values have to match one of the class probabilities, not both. To do so, compare the output with: `clf.predict_proba(X)[:,1]`. The shape of the arrays is not important.
##### The question 5 is validated if `predict_class` outputs the same classes as `cfl.predict(X)`. The shape of the arrays is not important.
##### The question 6 is validated if the plot looks like the plot below. As mentioned, it is not required to shift the class prediction to make the plot easier to understand.
![alt text][ex3q6]
[ex3q6]: ../w2_day2_ex3_q5.png "Scatter plot + Logistic regression + predictions"
##### The question 7 is validated if the plot looks like this:
![alt text][ex3q7]
[ex3q7]: ../w2_day2_ex3_q6.png "Logistic regression decision boundary"
---
---
#### Exercise 4: Train test split
##### The exercise is validated is all questions of the exercise are validated
##### The question 1 is validated if X_train, y_train, X_test, y_test match the output below. The proportion of class `1` is **0.125** in the train set and **1.** in the test set.
```console
X_train:
[[ 1 2]
[ 3 4]
[ 5 6]
[ 7 8]
[ 9 10]
[11 12]
[13 14]
[15 16]]
y_train:
[0. 0. 0. 0. 0. 0. 0. 1.]
X_test:
[[17 18]
[19 20]]
y_test:
[1. 1.]
```
##### The question 2 is validated if the proportion of class `1` is **0.3** for both sets.
---
---
#### Exercise 5: Breast Cancer prediction
##### The exercice is validated is all questions of the exercice are validated
##### The question 1 is validated if the proportion of class `Benign` is 0.6552217453505007. It means that if you always predict `Benign` your accuracy would be 66%.
##### The question 2 is validated if the proportion of one of the classes is the approximately the same on the train and test set: ~0.65. In my case:
- test: 0.6571428571428571
- train: 0.6547406082289803
##### The question 3 is validated if the output is:
```console
# Train
Class prediction on train set:
[4 2 4 2 2 2 2 4 2 2]
Probability prediction on train set:
[0.99600415 0.00908666 0.99992744 0.00528803 0.02097154 0.00582772
0.03565076 0.99515326 0.00788281 0.01065484]
Score on train set:
0.9695885509838998
#Test
Class prediction on test set:
[2 2 2 4 2 4 2 2 2 4]
Probability prediction on test set:
[0.01747203 0.22495309 0.00698756 0.54020801 0.0015289 0.99862249
0.33607994 0.01227679 0.00438157 0.99972344]
Score on test set:
0.9642857142857143
```
Only the 10 first predictions are outputted. The score is computed on all the data in the folds.
For some reasons, you may have a different data splitting as mine. The requirement for this question is to have a score on the test set bigger than 92%.
If the score is 1, congratulate you peer, he's just leaked his first target. The target should be dropped from the X_train or X_test ;) !
##### The question 4 is validated if the confusion matrix on the train set is similar to:
```console
array([[357, 9],
[ 8, 185]])
```
and if the confusion matrix on the test set is similar to:
```console
array([[90, 2],
[ 3, 45]])
```
As said, for some reasons, the results may be slightly different from mine because of the data splitting. However, the values in the confusion matrix should be close to these results.
---
---
#### Exercise 6: Multi-class (Optional)
##### The exercice is validated is all questions of the exercice are validated
##### The question 1 is validated if each classifier has as input a binary data as below:
```python
def train(X_train, y_train):
clf = LogisticRegression()
clf1 = LogisticRegression()
clf2 = LogisticRegression()
clf.fit(X_train, y_train == 0)
clf1.fit(X_train, y_train == 1)
clf2.fit(X_train, y_train == 2)
return clf, clf1, clf2
```
##### The question 2 is validated if the predicted classes on the test set are:
```console
array([0, 0, 2, 1, 2, 0, 2, 1, 1, 1, 0, 1, 2, 0, 1, 1, 0, 0, 2, 2, 0, 0,
0, 2, 2, 2, 0, 1, 0, 0])
```
Even if I had this warning `ConvergenceWarning: lbfgs failed to converge (status=1):` I noticed that `LogisticRegression` returns the same output.

699
subjects/ai/classification-with-scikit-learn /data/breast-cancer-wisconsin.data

@ -0,0 +1,699 @@
1000025,5,1,1,1,2,1,3,1,1,2
1002945,5,4,4,5,7,10,3,2,1,2
1015425,3,1,1,1,2,2,3,1,1,2
1016277,6,8,8,1,3,4,3,7,1,2
1017023,4,1,1,3,2,1,3,1,1,2
1017122,8,10,10,8,7,10,9,7,1,4
1018099,1,1,1,1,2,10,3,1,1,2
1018561,2,1,2,1,2,1,3,1,1,2
1033078,2,1,1,1,2,1,1,1,5,2
1033078,4,2,1,1,2,1,2,1,1,2
1035283,1,1,1,1,1,1,3,1,1,2
1036172,2,1,1,1,2,1,2,1,1,2
1041801,5,3,3,3,2,3,4,4,1,4
1043999,1,1,1,1,2,3,3,1,1,2
1044572,8,7,5,10,7,9,5,5,4,4
1047630,7,4,6,4,6,1,4,3,1,4
1048672,4,1,1,1,2,1,2,1,1,2
1049815,4,1,1,1,2,1,3,1,1,2
1050670,10,7,7,6,4,10,4,1,2,4
1050718,6,1,1,1,2,1,3,1,1,2
1054590,7,3,2,10,5,10,5,4,4,4
1054593,10,5,5,3,6,7,7,10,1,4
1056784,3,1,1,1,2,1,2,1,1,2
1057013,8,4,5,1,2,?,7,3,1,4
1059552,1,1,1,1,2,1,3,1,1,2
1065726,5,2,3,4,2,7,3,6,1,4
1066373,3,2,1,1,1,1,2,1,1,2
1066979,5,1,1,1,2,1,2,1,1,2
1067444,2,1,1,1,2,1,2,1,1,2
1070935,1,1,3,1,2,1,1,1,1,2
1070935,3,1,1,1,1,1,2,1,1,2
1071760,2,1,1,1,2,1,3,1,1,2
1072179,10,7,7,3,8,5,7,4,3,4
1074610,2,1,1,2,2,1,3,1,1,2
1075123,3,1,2,1,2,1,2,1,1,2
1079304,2,1,1,1,2,1,2,1,1,2
1080185,10,10,10,8,6,1,8,9,1,4
1081791,6,2,1,1,1,1,7,1,1,2
1084584,5,4,4,9,2,10,5,6,1,4
1091262,2,5,3,3,6,7,7,5,1,4
1096800,6,6,6,9,6,?,7,8,1,2
1099510,10,4,3,1,3,3,6,5,2,4
1100524,6,10,10,2,8,10,7,3,3,4
1102573,5,6,5,6,10,1,3,1,1,4
1103608,10,10,10,4,8,1,8,10,1,4
1103722,1,1,1,1,2,1,2,1,2,2
1105257,3,7,7,4,4,9,4,8,1,4
1105524,1,1,1,1,2,1,2,1,1,2
1106095,4,1,1,3,2,1,3,1,1,2
1106829,7,8,7,2,4,8,3,8,2,4
1108370,9,5,8,1,2,3,2,1,5,4
1108449,5,3,3,4,2,4,3,4,1,4
1110102,10,3,6,2,3,5,4,10,2,4
1110503,5,5,5,8,10,8,7,3,7,4
1110524,10,5,5,6,8,8,7,1,1,4
1111249,10,6,6,3,4,5,3,6,1,4
1112209,8,10,10,1,3,6,3,9,1,4
1113038,8,2,4,1,5,1,5,4,4,4
1113483,5,2,3,1,6,10,5,1,1,4
1113906,9,5,5,2,2,2,5,1,1,4
1115282,5,3,5,5,3,3,4,10,1,4
1115293,1,1,1,1,2,2,2,1,1,2
1116116,9,10,10,1,10,8,3,3,1,4
1116132,6,3,4,1,5,2,3,9,1,4
1116192,1,1,1,1,2,1,2,1,1,2
1116998,10,4,2,1,3,2,4,3,10,4
1117152,4,1,1,1,2,1,3,1,1,2
1118039,5,3,4,1,8,10,4,9,1,4
1120559,8,3,8,3,4,9,8,9,8,4
1121732,1,1,1,1,2,1,3,2,1,2
1121919,5,1,3,1,2,1,2,1,1,2
1123061,6,10,2,8,10,2,7,8,10,4
1124651,1,3,3,2,2,1,7,2,1,2
1125035,9,4,5,10,6,10,4,8,1,4
1126417,10,6,4,1,3,4,3,2,3,4
1131294,1,1,2,1,2,2,4,2,1,2
1132347,1,1,4,1,2,1,2,1,1,2
1133041,5,3,1,2,2,1,2,1,1,2
1133136,3,1,1,1,2,3,3,1,1,2
1136142,2,1,1,1,3,1,2,1,1,2
1137156,2,2,2,1,1,1,7,1,1,2
1143978,4,1,1,2,2,1,2,1,1,2
1143978,5,2,1,1,2,1,3,1,1,2
1147044,3,1,1,1,2,2,7,1,1,2
1147699,3,5,7,8,8,9,7,10,7,4
1147748,5,10,6,1,10,4,4,10,10,4
1148278,3,3,6,4,5,8,4,4,1,4
1148873,3,6,6,6,5,10,6,8,3,4
1152331,4,1,1,1,2,1,3,1,1,2
1155546,2,1,1,2,3,1,2,1,1,2
1156272,1,1,1,1,2,1,3,1,1,2
1156948,3,1,1,2,2,1,1,1,1,2
1157734,4,1,1,1,2,1,3,1,1,2
1158247,1,1,1,1,2,1,2,1,1,2
1160476,2,1,1,1,2,1,3,1,1,2
1164066,1,1,1,1,2,1,3,1,1,2
1165297,2,1,1,2,2,1,1,1,1,2
1165790,5,1,1,1,2,1,3,1,1,2
1165926,9,6,9,2,10,6,2,9,10,4
1166630,7,5,6,10,5,10,7,9,4,4
1166654,10,3,5,1,10,5,3,10,2,4
1167439,2,3,4,4,2,5,2,5,1,4
1167471,4,1,2,1,2,1,3,1,1,2
1168359,8,2,3,1,6,3,7,1,1,4
1168736,10,10,10,10,10,1,8,8,8,4
1169049,7,3,4,4,3,3,3,2,7,4
1170419,10,10,10,8,2,10,4,1,1,4
1170420,1,6,8,10,8,10,5,7,1,4
1171710,1,1,1,1,2,1,2,3,1,2
1171710,6,5,4,4,3,9,7,8,3,4
1171795,1,3,1,2,2,2,5,3,2,2
1171845,8,6,4,3,5,9,3,1,1,4
1172152,10,3,3,10,2,10,7,3,3,4
1173216,10,10,10,3,10,8,8,1,1,4
1173235,3,3,2,1,2,3,3,1,1,2
1173347,1,1,1,1,2,5,1,1,1,2
1173347,8,3,3,1,2,2,3,2,1,2
1173509,4,5,5,10,4,10,7,5,8,4
1173514,1,1,1,1,4,3,1,1,1,2
1173681,3,2,1,1,2,2,3,1,1,2
1174057,1,1,2,2,2,1,3,1,1,2
1174057,4,2,1,1,2,2,3,1,1,2
1174131,10,10,10,2,10,10,5,3,3,4
1174428,5,3,5,1,8,10,5,3,1,4
1175937,5,4,6,7,9,7,8,10,1,4
1176406,1,1,1,1,2,1,2,1,1,2
1176881,7,5,3,7,4,10,7,5,5,4
1177027,3,1,1,1,2,1,3,1,1,2
1177399,8,3,5,4,5,10,1,6,2,4
1177512,1,1,1,1,10,1,1,1,1,2
1178580,5,1,3,1,2,1,2,1,1,2
1179818,2,1,1,1,2,1,3,1,1,2
1180194,5,10,8,10,8,10,3,6,3,4
1180523,3,1,1,1,2,1,2,2,1,2
1180831,3,1,1,1,3,1,2,1,1,2
1181356,5,1,1,1,2,2,3,3,1,2
1182404,4,1,1,1,2,1,2,1,1,2
1182410,3,1,1,1,2,1,1,1,1,2
1183240,4,1,2,1,2,1,2,1,1,2
1183246,1,1,1,1,1,?,2,1,1,2
1183516,3,1,1,1,2,1,1,1,1,2
1183911,2,1,1,1,2,1,1,1,1,2
1183983,9,5,5,4,4,5,4,3,3,4
1184184,1,1,1,1,2,5,1,1,1,2
1184241,2,1,1,1,2,1,2,1,1,2
1184840,1,1,3,1,2,?,2,1,1,2
1185609,3,4,5,2,6,8,4,1,1,4
1185610,1,1,1,1,3,2,2,1,1,2
1187457,3,1,1,3,8,1,5,8,1,2
1187805,8,8,7,4,10,10,7,8,7,4
1188472,1,1,1,1,1,1,3,1,1,2
1189266,7,2,4,1,6,10,5,4,3,4
1189286,10,10,8,6,4,5,8,10,1,4
1190394,4,1,1,1,2,3,1,1,1,2
1190485,1,1,1,1,2,1,1,1,1,2
1192325,5,5,5,6,3,10,3,1,1,4
1193091,1,2,2,1,2,1,2,1,1,2
1193210,2,1,1,1,2,1,3,1,1,2
1193683,1,1,2,1,3,?,1,1,1,2
1196295,9,9,10,3,6,10,7,10,6,4
1196915,10,7,7,4,5,10,5,7,2,4
1197080,4,1,1,1,2,1,3,2,1,2
1197270,3,1,1,1,2,1,3,1,1,2
1197440,1,1,1,2,1,3,1,1,7,2
1197510,5,1,1,1,2,?,3,1,1,2
1197979,4,1,1,1,2,2,3,2,1,2
1197993,5,6,7,8,8,10,3,10,3,4
1198128,10,8,10,10,6,1,3,1,10,4
1198641,3,1,1,1,2,1,3,1,1,2
1199219,1,1,1,2,1,1,1,1,1,2
1199731,3,1,1,1,2,1,1,1,1,2
1199983,1,1,1,1,2,1,3,1,1,2
1200772,1,1,1,1,2,1,2,1,1,2
1200847,6,10,10,10,8,10,10,10,7,4
1200892,8,6,5,4,3,10,6,1,1,4
1200952,5,8,7,7,10,10,5,7,1,4
1201834,2,1,1,1,2,1,3,1,1,2
1201936,5,10,10,3,8,1,5,10,3,4
1202125,4,1,1,1,2,1,3,1,1,2
1202812,5,3,3,3,6,10,3,1,1,4
1203096,1,1,1,1,1,1,3,1,1,2
1204242,1,1,1,1,2,1,1,1,1,2
1204898,6,1,1,1,2,1,3,1,1,2
1205138,5,8,8,8,5,10,7,8,1,4
1205579,8,7,6,4,4,10,5,1,1,4
1206089,2,1,1,1,1,1,3,1,1,2
1206695,1,5,8,6,5,8,7,10,1,4
1206841,10,5,6,10,6,10,7,7,10,4
1207986,5,8,4,10,5,8,9,10,1,4
1208301,1,2,3,1,2,1,3,1,1,2
1210963,10,10,10,8,6,8,7,10,1,4
1211202,7,5,10,10,10,10,4,10,3,4
1212232,5,1,1,1,2,1,2,1,1,2
1212251,1,1,1,1,2,1,3,1,1,2
1212422,3,1,1,1,2,1,3,1,1,2
1212422,4,1,1,1,2,1,3,1,1,2
1213375,8,4,4,5,4,7,7,8,2,2
1213383,5,1,1,4,2,1,3,1,1,2
1214092,1,1,1,1,2,1,1,1,1,2
1214556,3,1,1,1,2,1,2,1,1,2
1214966,9,7,7,5,5,10,7,8,3,4
1216694,10,8,8,4,10,10,8,1,1,4
1216947,1,1,1,1,2,1,3,1,1,2
1217051,5,1,1,1,2,1,3,1,1,2
1217264,1,1,1,1,2,1,3,1,1,2
1218105,5,10,10,9,6,10,7,10,5,4
1218741,10,10,9,3,7,5,3,5,1,4
1218860,1,1,1,1,1,1,3,1,1,2
1218860,1,1,1,1,1,1,3,1,1,2
1219406,5,1,1,1,1,1,3,1,1,2
1219525,8,10,10,10,5,10,8,10,6,4
1219859,8,10,8,8,4,8,7,7,1,4
1220330,1,1,1,1,2,1,3,1,1,2
1221863,10,10,10,10,7,10,7,10,4,4
1222047,10,10,10,10,3,10,10,6,1,4
1222936,8,7,8,7,5,5,5,10,2,4
1223282,1,1,1,1,2,1,2,1,1,2
1223426,1,1,1,1,2,1,3,1,1,2
1223793,6,10,7,7,6,4,8,10,2,4
1223967,6,1,3,1,2,1,3,1,1,2
1224329,1,1,1,2,2,1,3,1,1,2
1225799,10,6,4,3,10,10,9,10,1,4
1226012,4,1,1,3,1,5,2,1,1,4
1226612,7,5,6,3,3,8,7,4,1,4
1227210,10,5,5,6,3,10,7,9,2,4
1227244,1,1,1,1,2,1,2,1,1,2
1227481,10,5,7,4,4,10,8,9,1,4
1228152,8,9,9,5,3,5,7,7,1,4
1228311,1,1,1,1,1,1,3,1,1,2
1230175,10,10,10,3,10,10,9,10,1,4
1230688,7,4,7,4,3,7,7,6,1,4
1231387,6,8,7,5,6,8,8,9,2,4
1231706,8,4,6,3,3,1,4,3,1,2
1232225,10,4,5,5,5,10,4,1,1,4
1236043,3,3,2,1,3,1,3,6,1,2
1241232,3,1,4,1,2,?,3,1,1,2
1241559,10,8,8,2,8,10,4,8,10,4
1241679,9,8,8,5,6,2,4,10,4,4
1242364,8,10,10,8,6,9,3,10,10,4
1243256,10,4,3,2,3,10,5,3,2,4
1270479,5,1,3,3,2,2,2,3,1,2
1276091,3,1,1,3,1,1,3,1,1,2
1277018,2,1,1,1,2,1,3,1,1,2
128059,1,1,1,1,2,5,5,1,1,2
1285531,1,1,1,1,2,1,3,1,1,2
1287775,5,1,1,2,2,2,3,1,1,2
144888,8,10,10,8,5,10,7,8,1,4
145447,8,4,4,1,2,9,3,3,1,4
167528,4,1,1,1,2,1,3,6,1,2
169356,3,1,1,1,2,?,3,1,1,2
183913,1,2,2,1,2,1,1,1,1,2
191250,10,4,4,10,2,10,5,3,3,4
1017023,6,3,3,5,3,10,3,5,3,2
1100524,6,10,10,2,8,10,7,3,3,4
1116116,9,10,10,1,10,8,3,3,1,4
1168736,5,6,6,2,4,10,3,6,1,4
1182404,3,1,1,1,2,1,1,1,1,2
1182404,3,1,1,1,2,1,2,1,1,2
1198641,3,1,1,1,2,1,3,1,1,2
242970,5,7,7,1,5,8,3,4,1,2
255644,10,5,8,10,3,10,5,1,3,4
263538,5,10,10,6,10,10,10,6,5,4
274137,8,8,9,4,5,10,7,8,1,4
303213,10,4,4,10,6,10,5,5,1,4
314428,7,9,4,10,10,3,5,3,3,4
1182404,5,1,4,1,2,1,3,2,1,2
1198641,10,10,6,3,3,10,4,3,2,4
320675,3,3,5,2,3,10,7,1,1,4
324427,10,8,8,2,3,4,8,7,8,4
385103,1,1,1,1,2,1,3,1,1,2
390840,8,4,7,1,3,10,3,9,2,4
411453,5,1,1,1,2,1,3,1,1,2
320675,3,3,5,2,3,10,7,1,1,4
428903,7,2,4,1,3,4,3,3,1,4
431495,3,1,1,1,2,1,3,2,1,2
432809,3,1,3,1,2,?,2,1,1,2
434518,3,1,1,1,2,1,2,1,1,2
452264,1,1,1,1,2,1,2,1,1,2
456282,1,1,1,1,2,1,3,1,1,2
476903,10,5,7,3,3,7,3,3,8,4
486283,3,1,1,1,2,1,3,1,1,2
486662,2,1,1,2,2,1,3,1,1,2
488173,1,4,3,10,4,10,5,6,1,4
492268,10,4,6,1,2,10,5,3,1,4
508234,7,4,5,10,2,10,3,8,2,4
527363,8,10,10,10,8,10,10,7,3,4
529329,10,10,10,10,10,10,4,10,10,4
535331,3,1,1,1,3,1,2,1,1,2
543558,6,1,3,1,4,5,5,10,1,4
555977,5,6,6,8,6,10,4,10,4,4
560680,1,1,1,1,2,1,1,1,1,2
561477,1,1,1,1,2,1,3,1,1,2
563649,8,8,8,1,2,?,6,10,1,4
601265,10,4,4,6,2,10,2,3,1,4
606140,1,1,1,1,2,?,2,1,1,2
606722,5,5,7,8,6,10,7,4,1,4
616240,5,3,4,3,4,5,4,7,1,2
61634,5,4,3,1,2,?,2,3,1,2
625201,8,2,1,1,5,1,1,1,1,2
63375,9,1,2,6,4,10,7,7,2,4
635844,8,4,10,5,4,4,7,10,1,4
636130,1,1,1,1,2,1,3,1,1,2
640744,10,10,10,7,9,10,7,10,10,4
646904,1,1,1,1,2,1,3,1,1,2
653777,8,3,4,9,3,10,3,3,1,4
659642,10,8,4,4,4,10,3,10,4,4
666090,1,1,1,1,2,1,3,1,1,2
666942,1,1,1,1,2,1,3,1,1,2
667204,7,8,7,6,4,3,8,8,4,4
673637,3,1,1,1,2,5,5,1,1,2
684955,2,1,1,1,3,1,2,1,1,2
688033,1,1,1,1,2,1,1,1,1,2
691628,8,6,4,10,10,1,3,5,1,4
693702,1,1,1,1,2,1,1,1,1,2
704097,1,1,1,1,1,1,2,1,1,2
704168,4,6,5,6,7,?,4,9,1,2
706426,5,5,5,2,5,10,4,3,1,4
709287,6,8,7,8,6,8,8,9,1,4
718641,1,1,1,1,5,1,3,1,1,2
721482,4,4,4,4,6,5,7,3,1,2
730881,7,6,3,2,5,10,7,4,6,4
733639,3,1,1,1,2,?,3,1,1,2
733639,3,1,1,1,2,1,3,1,1,2
733823,5,4,6,10,2,10,4,1,1,4
740492,1,1,1,1,2,1,3,1,1,2
743348,3,2,2,1,2,1,2,3,1,2
752904,10,1,1,1,2,10,5,4,1,4
756136,1,1,1,1,2,1,2,1,1,2
760001,8,10,3,2,6,4,3,10,1,4
760239,10,4,6,4,5,10,7,1,1,4
76389,10,4,7,2,2,8,6,1,1,4
764974,5,1,1,1,2,1,3,1,2,2
770066,5,2,2,2,2,1,2,2,1,2
785208,5,4,6,6,4,10,4,3,1,4
785615,8,6,7,3,3,10,3,4,2,4
792744,1,1,1,1,2,1,1,1,1,2
797327,6,5,5,8,4,10,3,4,1,4
798429,1,1,1,1,2,1,3,1,1,2
704097,1,1,1,1,1,1,2,1,1,2
806423,8,5,5,5,2,10,4,3,1,4
809912,10,3,3,1,2,10,7,6,1,4
810104,1,1,1,1,2,1,3,1,1,2
814265,2,1,1,1,2,1,1,1,1,2
814911,1,1,1,1,2,1,1,1,1,2
822829,7,6,4,8,10,10,9,5,3,4
826923,1,1,1,1,2,1,1,1,1,2
830690,5,2,2,2,3,1,1,3,1,2
831268,1,1,1,1,1,1,1,3,1,2
832226,3,4,4,10,5,1,3,3,1,4
832567,4,2,3,5,3,8,7,6,1,4
836433,5,1,1,3,2,1,1,1,1,2
837082,2,1,1,1,2,1,3,1,1,2
846832,3,4,5,3,7,3,4,6,1,2
850831,2,7,10,10,7,10,4,9,4,4
855524,1,1,1,1,2,1,2,1,1,2
857774,4,1,1,1,3,1,2,2,1,2
859164,5,3,3,1,3,3,3,3,3,4
859350,8,10,10,7,10,10,7,3,8,4
866325,8,10,5,3,8,4,4,10,3,4
873549,10,3,5,4,3,7,3,5,3,4
877291,6,10,10,10,10,10,8,10,10,4
877943,3,10,3,10,6,10,5,1,4,4
888169,3,2,2,1,4,3,2,1,1,2
888523,4,4,4,2,2,3,2,1,1,2
896404,2,1,1,1,2,1,3,1,1,2
897172,2,1,1,1,2,1,2,1,1,2
95719,6,10,10,10,8,10,7,10,7,4
160296,5,8,8,10,5,10,8,10,3,4
342245,1,1,3,1,2,1,1,1,1,2
428598,1,1,3,1,1,1,2,1,1,2
492561,4,3,2,1,3,1,2,1,1,2
493452,1,1,3,1,2,1,1,1,1,2
493452,4,1,2,1,2,1,2,1,1,2
521441,5,1,1,2,2,1,2,1,1,2
560680,3,1,2,1,2,1,2,1,1,2
636437,1,1,1,1,2,1,1,1,1,2
640712,1,1,1,1,2,1,2,1,1,2
654244,1,1,1,1,1,1,2,1,1,2
657753,3,1,1,4,3,1,2,2,1,2
685977,5,3,4,1,4,1,3,1,1,2
805448,1,1,1,1,2,1,1,1,1,2
846423,10,6,3,6,4,10,7,8,4,4
1002504,3,2,2,2,2,1,3,2,1,2
1022257,2,1,1,1,2,1,1,1,1,2
1026122,2,1,1,1,2,1,1,1,1,2
1071084,3,3,2,2,3,1,1,2,3,2
1080233,7,6,6,3,2,10,7,1,1,4
1114570,5,3,3,2,3,1,3,1,1,2
1114570,2,1,1,1,2,1,2,2,1,2
1116715,5,1,1,1,3,2,2,2,1,2
1131411,1,1,1,2,2,1,2,1,1,2
1151734,10,8,7,4,3,10,7,9,1,4
1156017,3,1,1,1,2,1,2,1,1,2
1158247,1,1,1,1,1,1,1,1,1,2
1158405,1,2,3,1,2,1,2,1,1,2
1168278,3,1,1,1,2,1,2,1,1,2
1176187,3,1,1,1,2,1,3,1,1,2
1196263,4,1,1,1,2,1,1,1,1,2
1196475,3,2,1,1,2,1,2,2,1,2
1206314,1,2,3,1,2,1,1,1,1,2
1211265,3,10,8,7,6,9,9,3,8,4
1213784,3,1,1,1,2,1,1,1,1,2
1223003,5,3,3,1,2,1,2,1,1,2
1223306,3,1,1,1,2,4,1,1,1,2
1223543,1,2,1,3,2,1,1,2,1,2
1229929,1,1,1,1,2,1,2,1,1,2
1231853,4,2,2,1,2,1,2,1,1,2
1234554,1,1,1,1,2,1,2,1,1,2
1236837,2,3,2,2,2,2,3,1,1,2
1237674,3,1,2,1,2,1,2,1,1,2
1238021,1,1,1,1,2,1,2,1,1,2
1238464,1,1,1,1,1,?,2,1,1,2
1238633,10,10,10,6,8,4,8,5,1,4
1238915,5,1,2,1,2,1,3,1,1,2
1238948,8,5,6,2,3,10,6,6,1,4
1239232,3,3,2,6,3,3,3,5,1,2
1239347,8,7,8,5,10,10,7,2,1,4
1239967,1,1,1,1,2,1,2,1,1,2
1240337,5,2,2,2,2,2,3,2,2,2
1253505,2,3,1,1,5,1,1,1,1,2
1255384,3,2,2,3,2,3,3,1,1,2
1257200,10,10,10,7,10,10,8,2,1,4
1257648,4,3,3,1,2,1,3,3,1,2
1257815,5,1,3,1,2,1,2,1,1,2
1257938,3,1,1,1,2,1,1,1,1,2
1258549,9,10,10,10,10,10,10,10,1,4
1258556,5,3,6,1,2,1,1,1,1,2
1266154,8,7,8,2,4,2,5,10,1,4
1272039,1,1,1,1,2,1,2,1,1,2
1276091,2,1,1,1,2,1,2,1,1,2
1276091,1,3,1,1,2,1,2,2,1,2
1276091,5,1,1,3,4,1,3,2,1,2
1277629,5,1,1,1,2,1,2,2,1,2
1293439,3,2,2,3,2,1,1,1,1,2
1293439,6,9,7,5,5,8,4,2,1,2
1294562,10,8,10,1,3,10,5,1,1,4
1295186,10,10,10,1,6,1,2,8,1,4
527337,4,1,1,1,2,1,1,1,1,2
558538,4,1,3,3,2,1,1,1,1,2
566509,5,1,1,1,2,1,1,1,1,2
608157,10,4,3,10,4,10,10,1,1,4
677910,5,2,2,4,2,4,1,1,1,2
734111,1,1,1,3,2,3,1,1,1,2
734111,1,1,1,1,2,2,1,1,1,2
780555,5,1,1,6,3,1,2,1,1,2
827627,2,1,1,1,2,1,1,1,1,2
1049837,1,1,1,1,2,1,1,1,1,2
1058849,5,1,1,1,2,1,1,1,1,2
1182404,1,1,1,1,1,1,1,1,1,2
1193544,5,7,9,8,6,10,8,10,1,4
1201870,4,1,1,3,1,1,2,1,1,2
1202253,5,1,1,1,2,1,1,1,1,2
1227081,3,1,1,3,2,1,1,1,1,2
1230994,4,5,5,8,6,10,10,7,1,4
1238410,2,3,1,1,3,1,1,1,1,2
1246562,10,2,2,1,2,6,1,1,2,4
1257470,10,6,5,8,5,10,8,6,1,4
1259008,8,8,9,6,6,3,10,10,1,4
1266124,5,1,2,1,2,1,1,1,1,2
1267898,5,1,3,1,2,1,1,1,1,2
1268313,5,1,1,3,2,1,1,1,1,2
1268804,3,1,1,1,2,5,1,1,1,2
1276091,6,1,1,3,2,1,1,1,1,2
1280258,4,1,1,1,2,1,1,2,1,2
1293966,4,1,1,1,2,1,1,1,1,2
1296572,10,9,8,7,6,4,7,10,3,4
1298416,10,6,6,2,4,10,9,7,1,4
1299596,6,6,6,5,4,10,7,6,2,4
1105524,4,1,1,1,2,1,1,1,1,2
1181685,1,1,2,1,2,1,2,1,1,2
1211594,3,1,1,1,1,1,2,1,1,2
1238777,6,1,1,3,2,1,1,1,1,2
1257608,6,1,1,1,1,1,1,1,1,2
1269574,4,1,1,1,2,1,1,1,1,2
1277145,5,1,1,1,2,1,1,1,1,2
1287282,3,1,1,1,2,1,1,1,1,2
1296025,4,1,2,1,2,1,1,1,1,2
1296263,4,1,1,1,2,1,1,1,1,2
1296593,5,2,1,1,2,1,1,1,1,2
1299161,4,8,7,10,4,10,7,5,1,4
1301945,5,1,1,1,1,1,1,1,1,2
1302428,5,3,2,4,2,1,1,1,1,2
1318169,9,10,10,10,10,5,10,10,10,4
474162,8,7,8,5,5,10,9,10,1,4
787451,5,1,2,1,2,1,1,1,1,2
1002025,1,1,1,3,1,3,1,1,1,2
1070522,3,1,1,1,1,1,2,1,1,2
1073960,10,10,10,10,6,10,8,1,5,4
1076352,3,6,4,10,3,3,3,4,1,4
1084139,6,3,2,1,3,4,4,1,1,4
1115293,1,1,1,1,2,1,1,1,1,2
1119189,5,8,9,4,3,10,7,1,1,4
1133991,4,1,1,1,1,1,2,1,1,2
1142706,5,10,10,10,6,10,6,5,2,4
1155967,5,1,2,10,4,5,2,1,1,2
1170945,3,1,1,1,1,1,2,1,1,2
1181567,1,1,1,1,1,1,1,1,1,2
1182404,4,2,1,1,2,1,1,1,1,2
1204558,4,1,1,1,2,1,2,1,1,2
1217952,4,1,1,1,2,1,2,1,1,2
1224565,6,1,1,1,2,1,3,1,1,2
1238186,4,1,1,1,2,1,2,1,1,2
1253917,4,1,1,2,2,1,2,1,1,2
1265899,4,1,1,1,2,1,3,1,1,2
1268766,1,1,1,1,2,1,1,1,1,2
1277268,3,3,1,1,2,1,1,1,1,2
1286943,8,10,10,10,7,5,4,8,7,4
1295508,1,1,1,1,2,4,1,1,1,2
1297327,5,1,1,1,2,1,1,1,1,2
1297522,2,1,1,1,2,1,1,1,1,2
1298360,1,1,1,1,2,1,1,1,1,2
1299924,5,1,1,1,2,1,2,1,1,2
1299994,5,1,1,1,2,1,1,1,1,2
1304595,3,1,1,1,1,1,2,1,1,2
1306282,6,6,7,10,3,10,8,10,2,4
1313325,4,10,4,7,3,10,9,10,1,4
1320077,1,1,1,1,1,1,1,1,1,2
1320077,1,1,1,1,1,1,2,1,1,2
1320304,3,1,2,2,2,1,1,1,1,2
1330439,4,7,8,3,4,10,9,1,1,4
333093,1,1,1,1,3,1,1,1,1,2
369565,4,1,1,1,3,1,1,1,1,2
412300,10,4,5,4,3,5,7,3,1,4
672113,7,5,6,10,4,10,5,3,1,4
749653,3,1,1,1,2,1,2,1,1,2
769612,3,1,1,2,2,1,1,1,1,2
769612,4,1,1,1,2,1,1,1,1,2
798429,4,1,1,1,2,1,3,1,1,2
807657,6,1,3,2,2,1,1,1,1,2
8233704,4,1,1,1,1,1,2,1,1,2
837480,7,4,4,3,4,10,6,9,1,4
867392,4,2,2,1,2,1,2,1,1,2
869828,1,1,1,1,1,1,3,1,1,2
1043068,3,1,1,1,2,1,2,1,1,2
1056171,2,1,1,1,2,1,2,1,1,2
1061990,1,1,3,2,2,1,3,1,1,2
1113061,5,1,1,1,2,1,3,1,1,2
1116192,5,1,2,1,2,1,3,1,1,2
1135090,4,1,1,1,2,1,2,1,1,2
1145420,6,1,1,1,2,1,2,1,1,2
1158157,5,1,1,1,2,2,2,1,1,2
1171578,3,1,1,1,2,1,1,1,1,2
1174841,5,3,1,1,2,1,1,1,1,2
1184586,4,1,1,1,2,1,2,1,1,2
1186936,2,1,3,2,2,1,2,1,1,2
1197527,5,1,1,1,2,1,2,1,1,2
1222464,6,10,10,10,4,10,7,10,1,4
1240603,2,1,1,1,1,1,1,1,1,2
1240603,3,1,1,1,1,1,1,1,1,2
1241035,7,8,3,7,4,5,7,8,2,4
1287971,3,1,1,1,2,1,2,1,1,2
1289391,1,1,1,1,2,1,3,1,1,2
1299924,3,2,2,2,2,1,4,2,1,2
1306339,4,4,2,1,2,5,2,1,2,2
1313658,3,1,1,1,2,1,1,1,1,2
1313982,4,3,1,1,2,1,4,8,1,2
1321264,5,2,2,2,1,1,2,1,1,2
1321321,5,1,1,3,2,1,1,1,1,2
1321348,2,1,1,1,2,1,2,1,1,2
1321931,5,1,1,1,2,1,2,1,1,2
1321942,5,1,1,1,2,1,3,1,1,2
1321942,5,1,1,1,2,1,3,1,1,2
1328331,1,1,1,1,2,1,3,1,1,2
1328755,3,1,1,1,2,1,2,1,1,2
1331405,4,1,1,1,2,1,3,2,1,2
1331412,5,7,10,10,5,10,10,10,1,4
1333104,3,1,2,1,2,1,3,1,1,2
1334071,4,1,1,1,2,3,2,1,1,2
1343068,8,4,4,1,6,10,2,5,2,4
1343374,10,10,8,10,6,5,10,3,1,4
1344121,8,10,4,4,8,10,8,2,1,4
142932,7,6,10,5,3,10,9,10,2,4
183936,3,1,1,1,2,1,2,1,1,2
324382,1,1,1,1,2,1,2,1,1,2
378275,10,9,7,3,4,2,7,7,1,4
385103,5,1,2,1,2,1,3,1,1,2
690557,5,1,1,1,2,1,2,1,1,2
695091,1,1,1,1,2,1,2,1,1,2
695219,1,1,1,1,2,1,2,1,1,2
824249,1,1,1,1,2,1,3,1,1,2
871549,5,1,2,1,2,1,2,1,1,2
878358,5,7,10,6,5,10,7,5,1,4
1107684,6,10,5,5,4,10,6,10,1,4
1115762,3,1,1,1,2,1,1,1,1,2
1217717,5,1,1,6,3,1,1,1,1,2
1239420,1,1,1,1,2,1,1,1,1,2
1254538,8,10,10,10,6,10,10,10,1,4
1261751,5,1,1,1,2,1,2,2,1,2
1268275,9,8,8,9,6,3,4,1,1,4
1272166,5,1,1,1,2,1,1,1,1,2
1294261,4,10,8,5,4,1,10,1,1,4
1295529,2,5,7,6,4,10,7,6,1,4
1298484,10,3,4,5,3,10,4,1,1,4
1311875,5,1,2,1,2,1,1,1,1,2
1315506,4,8,6,3,4,10,7,1,1,4
1320141,5,1,1,1,2,1,2,1,1,2
1325309,4,1,2,1,2,1,2,1,1,2
1333063,5,1,3,1,2,1,3,1,1,2
1333495,3,1,1,1,2,1,2,1,1,2
1334659,5,2,4,1,1,1,1,1,1,2
1336798,3,1,1,1,2,1,2,1,1,2
1344449,1,1,1,1,1,1,2,1,1,2
1350568,4,1,1,1,2,1,2,1,1,2
1352663,5,4,6,8,4,1,8,10,1,4
188336,5,3,2,8,5,10,8,1,2,4
352431,10,5,10,3,5,8,7,8,3,4
353098,4,1,1,2,2,1,1,1,1,2
411453,1,1,1,1,2,1,1,1,1,2
557583,5,10,10,10,10,10,10,1,1,4
636375,5,1,1,1,2,1,1,1,1,2
736150,10,4,3,10,3,10,7,1,2,4
803531,5,10,10,10,5,2,8,5,1,4
822829,8,10,10,10,6,10,10,10,10,4
1016634,2,3,1,1,2,1,2,1,1,2
1031608,2,1,1,1,1,1,2,1,1,2
1041043,4,1,3,1,2,1,2,1,1,2
1042252,3,1,1,1,2,1,2,1,1,2
1057067,1,1,1,1,1,?,1,1,1,2
1061990,4,1,1,1,2,1,2,1,1,2
1073836,5,1,1,1,2,1,2,1,1,2
1083817,3,1,1,1,2,1,2,1,1,2
1096352,6,3,3,3,3,2,6,1,1,2
1140597,7,1,2,3,2,1,2,1,1,2
1149548,1,1,1,1,2,1,1,1,1,2
1174009,5,1,1,2,1,1,2,1,1,2
1183596,3,1,3,1,3,4,1,1,1,2
1190386,4,6,6,5,7,6,7,7,3,4
1190546,2,1,1,1,2,5,1,1,1,2
1213273,2,1,1,1,2,1,1,1,1,2
1218982,4,1,1,1,2,1,1,1,1,2
1225382,6,2,3,1,2,1,1,1,1,2
1235807,5,1,1,1,2,1,2,1,1,2
1238777,1,1,1,1,2,1,1,1,1,2
1253955,8,7,4,4,5,3,5,10,1,4
1257366,3,1,1,1,2,1,1,1,1,2
1260659,3,1,4,1,2,1,1,1,1,2
1268952,10,10,7,8,7,1,10,10,3,4
1275807,4,2,4,3,2,2,2,1,1,2
1277792,4,1,1,1,2,1,1,1,1,2
1277792,5,1,1,3,2,1,1,1,1,2
1285722,4,1,1,3,2,1,1,1,1,2
1288608,3,1,1,1,2,1,2,1,1,2
1290203,3,1,1,1,2,1,2,1,1,2
1294413,1,1,1,1,2,1,1,1,1,2
1299596,2,1,1,1,2,1,1,1,1,2
1303489,3,1,1,1,2,1,2,1,1,2
1311033,1,2,2,1,2,1,1,1,1,2
1311108,1,1,1,3,2,1,1,1,1,2
1315807,5,10,10,10,10,2,10,10,10,4
1318671,3,1,1,1,2,1,2,1,1,2
1319609,3,1,1,2,3,4,1,1,1,2
1323477,1,2,1,3,2,1,2,1,1,2
1324572,5,1,1,1,2,1,2,2,1,2
1324681,4,1,1,1,2,1,2,1,1,2
1325159,3,1,1,1,2,1,3,1,1,2
1326892,3,1,1,1,2,1,2,1,1,2
1330361,5,1,1,1,2,1,2,1,1,2
1333877,5,4,5,1,8,1,3,6,1,2
1334015,7,8,8,7,3,10,7,2,3,4
1334667,1,1,1,1,2,1,1,1,1,2
1339781,1,1,1,1,2,1,2,1,1,2
1339781,4,1,1,1,2,1,3,1,1,2
13454352,1,1,3,1,2,1,2,1,1,2
1345452,1,1,3,1,2,1,2,1,1,2
1345593,3,1,1,3,2,1,2,1,1,2
1347749,1,1,1,1,2,1,1,1,1,2
1347943,5,2,2,2,2,1,1,1,2,2
1348851,3,1,1,1,2,1,3,1,1,2
1350319,5,7,4,1,6,1,7,10,3,4
1350423,5,10,10,8,5,5,7,10,1,4
1352848,3,10,7,8,5,8,7,4,1,4
1353092,3,2,1,2,2,1,3,1,1,2
1354840,2,1,1,1,2,1,3,1,1,2
1354840,5,3,2,1,3,1,1,1,1,2
1355260,1,1,1,1,2,1,2,1,1,2
1365075,4,1,4,1,2,1,1,1,1,2
1365328,1,1,2,1,2,1,2,1,1,2
1368267,5,1,1,1,2,1,1,1,1,2
1368273,1,1,1,1,2,1,1,1,1,2
1368882,2,1,1,1,2,1,1,1,1,2
1369821,10,10,10,10,5,10,10,10,7,4
1371026,5,10,10,10,4,10,5,6,3,4
1371920,5,1,1,1,2,1,3,2,1,2
466906,1,1,1,1,2,1,1,1,1,2
466906,1,1,1,1,2,1,1,1,1,2
534555,1,1,1,1,2,1,1,1,1,2
536708,1,1,1,1,2,1,1,1,1,2
566346,3,1,1,1,2,1,2,3,1,2
603148,4,1,1,1,2,1,1,1,1,2
654546,1,1,1,1,2,1,1,1,8,2
654546,1,1,1,3,2,1,1,1,1,2
695091,5,10,10,5,4,5,4,4,1,4
714039,3,1,1,1,2,1,1,1,1,2
763235,3,1,1,1,2,1,2,1,2,2
776715,3,1,1,1,3,2,1,1,1,2
841769,2,1,1,1,2,1,1,1,1,2
888820,5,10,10,3,7,3,8,10,2,4
897471,4,8,6,4,3,4,10,6,1,4
897471,4,8,8,5,4,5,10,4,1,4

126
subjects/ai/classification-with-scikit-learn /data/breast-cancer-wisconsin.names

@ -0,0 +1,126 @@
Citation Request:
This breast cancer databases was obtained from the University of Wisconsin
Hospitals, Madison from Dr. William H. Wolberg. If you publish results
when using this database, then please include this information in your
acknowledgements. Also, please cite one or more of:
1. O. L. Mangasarian and W. H. Wolberg: "Cancer diagnosis via linear
programming", SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18.
2. William H. Wolberg and O.L. Mangasarian: "Multisurface method of
pattern separation for medical diagnosis applied to breast cytology",
Proceedings of the National Academy of Sciences, U.S.A., Volume 87,
December 1990, pp 9193-9196.
3. O. L. Mangasarian, R. Setiono, and W.H. Wolberg: "Pattern recognition
via linear programming: Theory and application to medical diagnosis",
in: "Large-scale numerical optimization", Thomas F. Coleman and Yuying
Li, editors, SIAM Publications, Philadelphia 1990, pp 22-30.
4. K. P. Bennett & O. L. Mangasarian: "Robust linear programming
discrimination of two linearly inseparable sets", Optimization Methods
and Software 1, 1992, 23-34 (Gordon & Breach Science Publishers).
1. Title: Wisconsin Breast Cancer Database (January 8, 1991)
2. Sources:
-- Dr. WIlliam H. Wolberg (physician)
University of Wisconsin Hospitals
Madison, Wisconsin
USA
-- Donor: Olvi Mangasarian (mangasarian@cs.wisc.edu)
Received by David W. Aha (aha@cs.jhu.edu)
-- Date: 15 July 1992
3. Past Usage:
Attributes 2 through 10 have been used to represent instances.
Each instance has one of 2 possible classes: benign or malignant.
1. Wolberg,~W.~H., \& Mangasarian,~O.~L. (1990). Multisurface method of
pattern separation for medical diagnosis applied to breast cytology. In
{\it Proceedings of the National Academy of Sciences}, {\it 87},
9193--9196.
-- Size of data set: only 369 instances (at that point in time)
-- Collected classification results: 1 trial only
-- Two pairs of parallel hyperplanes were found to be consistent with
50% of the data
-- Accuracy on remaining 50% of dataset: 93.5%
-- Three pairs of parallel hyperplanes were found to be consistent with
67% of data
-- Accuracy on remaining 33% of dataset: 95.9%
2. Zhang,~J. (1992). Selecting typical instances in instance-based
learning. In {\it Proceedings of the Ninth International Machine
Learning Conference} (pp. 470--479). Aberdeen, Scotland: Morgan
Kaufmann.
-- Size of data set: only 369 instances (at that point in time)
-- Applied 4 instance-based learning algorithms
-- Collected classification results averaged over 10 trials
-- Best accuracy result:
-- 1-nearest neighbor: 93.7%
-- trained on 200 instances, tested on the other 169
-- Also of interest:
-- Using only typical instances: 92.2% (storing only 23.1 instances)
-- trained on 200 instances, tested on the other 169
4. Relevant Information:
Samples arrive periodically as Dr. Wolberg reports his clinical cases.
The database therefore reflects this chronological grouping of the data.
This grouping information appears immediately below, having been removed
from the data itself:
Group 1: 367 instances (January 1989)
Group 2: 70 instances (October 1989)
Group 3: 31 instances (February 1990)
Group 4: 17 instances (April 1990)
Group 5: 48 instances (August 1990)
Group 6: 49 instances (Updated January 1991)
Group 7: 31 instances (June 1991)
Group 8: 86 instances (November 1991)
-----------------------------------------
Total: 699 points (as of the donated datbase on 15 July 1992)
Note that the results summarized above in Past Usage refer to a dataset
of size 369, while Group 1 has only 367 instances. This is because it
originally contained 369 instances; 2 were removed. The following
statements summarizes changes to the original Group 1's set of data:
##### Group 1 : 367 points: 200B 167M (January 1989)
##### Revised Jan 10, 1991: Replaced zero bare nuclei in 1080185 & 1187805
##### Revised Nov 22,1991: Removed 765878,4,5,9,7,10,10,10,3,8,1 no record
##### : Removed 484201,2,7,8,8,4,3,10,3,4,1 zero epithelial
##### : Changed 0 to 1 in field 6 of sample 1219406
##### : Changed 0 to 1 in field 8 of following sample:
##### : 1182404,2,3,1,1,1,2,0,1,1,1
5. Number of Instances: 699 (as of 15 July 1992)
6. Number of Attributes: 10 plus the class attribute
7. Attribute Information: (class attribute has been moved to last column)
# Attribute Domain
-- -----------------------------------------
1. Sample code number id number
2. Clump Thickness 1 - 10
3. Uniformity of Cell Size 1 - 10
4. Uniformity of Cell Shape 1 - 10
5. Marginal Adhesion 1 - 10
6. Single Epithelial Cell Size 1 - 10
7. Bare Nuclei 1 - 10
8. Bland Chromatin 1 - 10
9. Normal Nucleoli 1 - 10
10. Mitoses 1 - 10
11. Class: (2 for benign, 4 for malignant)
8. Missing attribute values: 16
There are 16 instances in Groups 1 to 6 that contain a single missing
(i.e., unavailable) attribute value, now denoted by "?".
9. Class distribution:
Benign: 458 (65.5%)
Malignant: 241 (34.5%)

BIN
subjects/ai/classification-with-scikit-learn /w2_day2_ex2_q1.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 53 KiB

BIN
subjects/ai/classification-with-scikit-learn /w2_day2_ex3_q1.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 30 KiB

BIN
subjects/ai/classification-with-scikit-learn /w2_day2_ex3_q3.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 47 KiB

BIN
subjects/ai/classification-with-scikit-learn /w2_day2_ex3_q5.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 50 KiB

BIN
subjects/ai/classification-with-scikit-learn /w2_day2_ex3_q6.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 121 KiB

117
subjects/ai/credit-scoring/README.md

@ -0,0 +1,117 @@
# Credit scoring
The goal of this project is to implement a scoring model based on various source of data (check data documentation) that returns the probability of default. In a nutshell, credit scoring represents an evaluation of how well the bank's customer can pay and is willing to pay off debt. It is also required that you provide an explanation of the score. For example, your model returns that the probability that one client doesn't pay back the loan is very high (90%). The reason behind is that variable_xxx which represents the ability to pay back the past loan is low. The output interpretability will appear in a visualization.
The ability to understand the underlying factors of credit scoring is important. Credit scoring is subject to more and more regulation, so transparency is key. And more generaly, more and more companies prefer transparency to black box models.
### Resources
Historical timeline of machine learning techniques applied to credit scoring
- https://hal.archives-ouvertes.fr/hal-02507499v3/document
- https://www.kaggle.com/c/home-credit-default-risk/data
# Deliverables
### Scoring model
The are 3 expected deliverables associated with the scoring model:
- An exploratory data analysis notebook that describes the insights you find out in the data set.
- The trained machine learning model with the features engineering pipeline:
- Do not forget: **Coming up with features is difficult, time-consuming, requires expert knowledge. ‘Applied machine learning’ is basically feature engineering.**
- The model is validated if the **AUC on the test set is higher than 75%**.
- The labelled test data is not publicly available. However a Kaggle competition uses the same data. The procedure to evaluate test set submission is the same as the one used for the project 1.
### Kaggle submission
The way the Kaggle platform works is explained in the challenge overview page. If you need more details, I suggest this resource that gives detailed explanations.
- https://towardsdatascience.com/getting-started-with-kaggle-f9138b35ae18
- Create a username following that structure: username*01EDU* location_MM_YYYY. Submit the description profile and push it on the Git platform the first day of the week. Do not touch this file anymore.
- A text document that describes the methodology used to train the machine learning model:
- Algorithm
- Why the accuracy shouldn't be used in that case ?
- Limit and possible improvements
### Model interpretability
This part hasn't been covered during the piscine. Take the time to understand this key concept.
There are different level of transparency:
- **Global**: understand important variables in a model. This answers the question: "What are the key variables to the model ? ". In that case it will tell if the revenue is more important than the age to the model for example. This allows to check that the model relies on important variables. No one wants his credit to be refused because of the weather in Lisbon !
- **Local**: each observation gets its own set of interpretability factors. This greatly increases its transparency. We can explain why a case receives its prediction and the contributions of the predictors. Traditional variable importance algorithms only show the results across the entire population but not on each individual case. The local interpretability enables us to pinpoint and contrast the impacts of the factors.
There are 2 tools you can use to analyse your model and its predictions: - Features importance (available if you use a Scikit Learn model) - [SHAP library](https://towardsdatascience.com/explain-your-model-with-the-shap-values-bc36aac4de3d)
Implement a program that takes as input the trained model, the customer id ... and returns:
- the score and the SHAP force plot associated with it
- Plotly visualisations that show:
- key variables describing the client and its loan(s)
- comparison between this client and other clients
Choose the 3 clients of your choice, compute the score, run the visualizations on their data and save them.
- Take 2 clients from the train set:
- 1 on which the model is correct and the other on which the model is wrong. Try to understand why the model got wrong on this client.
- Take 1 client from the test set
### Optional
Implement a dashboard (using Dash) that takes as input the customer id and that returns the score and the required visualizations.
- https://stackoverflow.com/questions/54292226/putting-html-output-from-shap-into-the-dash-output-layout-callback
### Deliverables
```
project
│ README.md
│ environment.yml
└───data
│ │ ...
└───results
│ │
| |───model (free format)
│ │ │ my_own_model.pkl
│ │ │ model_report.txt
│ │
| |feature_engineering
│ │ │ EDA.ipynb
│ │
| |───clients_outputs
| | | client1_correct_train.pdf (free format)
│ │ │ client2_wrong_train.pdf (free format)
│ │ │ client_test.pdf (free format)
│ │
| |───dashboard (optional)
| | | dashboard.py (free format)
│ │ │ ...
|
|───scripts (free format)
│ │ train.py
│ │ predict.py
│ │ preprocess.py
```
- `README.md` introduces the project and shows the username.
- `environment.yml` contains all libraries required to run the code.
- `username.txt` contains the username, the last modified date of the file **has to correspond to the first day of the project**.
- `EDA.ipynb` contains the exploratory data analysis. This file is should contain all steps of data analysis that contributed or not to improve the score of the model. It has to be commented so that the reviewer can understand the analysis and run it without any problem.
- `scripts` contains python file(s) that perform(s) the feature engineering, the model's training and prediction on the test set. It could also be one single Jupyter Notebook. It has to be commented to help the reviewers understand the approach and run the code without any bugs.
### Useful resources
- https://towardsdatascience.com/interpretability-in-machine-learning-70c30694a05f
### Files needed for this project
[File 1](https://assets.01-edu.org/ai-branch/project5/project05-20221024T130417Z-001.zip)
[File 2](https://assets.01-edu.org/ai-branch/project5/project05-20221024T130417Z-002.zip)

87
subjects/ai/credit-scoring/audit/README.md

@ -0,0 +1,87 @@
#### Credit scoring
##### Preliminary
```
project
│ README.md
│ environment.yml
└───data
│ │ ...
└───results
│ │
| |───model (free format)
│ │ │ my_own_model.pkl
│ │ │ model_report.txt
│ │
| |feature_engineering
│ │ │ EDA.ipynb
│ │
| |───clients_outputs
| | | client1_correct_train.pdf (free format)
│ │ │ client2_wrong_train.pdf (free format)
│ │ │ client_test.pdf (free format)
│ │
| |───dashboard (optional)
| | | dashboard.py (free format)
│ │ │ ...
|
|───scripts (free format)
│ │ train.py
│ │ predict.py
│ │ preprocess.py
```
###### Does the structure of the project is as below ?
###### Does the readme file introduce the project, summurize how to run the code and show the username ?
###### Does the environment contain all libraries used and their versions that are necessary to run the code ?
###### Does the `EDA.ipynb` explain in details the exploratory data analysis ?
## Machine learning model
###### Is the model trained only the training set ?
###### Is the AUC on the test set is higher than 75% ?
###### Does the model learning curves prove that the model is not overfitting ?
###### Has the training been stopped early enough to avoid the overfitting ?
###### Does the text document `model_report.txt` describe the methodology used to train the machine learning model ?
###### Does `predict.py` run without any error and returns the following ?
```prompt
python predict.py
AUC on test set: 0.76
```
This [article](https://medium.com/thecyphy/home-credit-default-risk-part-2-84b58c1ab9d5) gives a complete example of a good modelling approach:
## Model's interpretability
### Feature importance:
###### Are the importance of all features used by the model computed and showed in a visualisation ?
###### Is the mapping between between the importance of the features and the features' name is correct ? You should be careful here to associate the right variables to the their feature importance. Sometimes, the preprocessing pipeline can remove some features during the features selection step for instance.
### Descriptive variables:
##### These are important to understand for example the age of the client. If the data could be scaled or modified in the preprocessing pipeline but the data visualised here should be "raw". This part is validated if the visualisations are computed for the 3 clients.
- visualisations that show at least 10 variables describing the client and its loan(s)
- visualisations that show the comparison between this client and other clients.
##### SHAP values on the model are displayed through a summary plot that shows the important features and their impact on the target. This is optional if you have already computed the features importance.
###### Do the 3 clients are selected as expected ? 2 clients from the train set (1 on which the model is correct and 1 on which the model's wrong) and 1 client from the test set.
##### SHAP values on predictions are computed for the 3 clients. The force plot shows what variables contributes the most to the score. **Check that the score outputted by the force plot corresponds to the one outputted by the model.**

BIN
subjects/ai/credit-scoring/data_description.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 358 KiB

46
subjects/ai/credit-scoring/readme_data.md

@ -0,0 +1,46 @@
# Credit scoring data description
This file describes the available data for the project.
![alt data description](data_description.png "Credit scoring data description")
## application_{train|test}.csv
This is the main table, broken into two files for Train (with TARGET) and Test (without TARGET).
Static data for all applications. One row represents one loan in our data sample.
## bureau.csv
All client's previous credits provided by other financial institutions that were reported to Credit Bureau (for clients who have a loan in our sample).
For every loan in our sample, there are as many rows as number of credits the client had in Credit Bureau before the application date.
## bureau_balance.csv
Monthly balances of previous credits in Credit Bureau.
This table has one row for each month of history of every previous credit reported to Credit Bureau – i.e the table has (#loans in sample * # of relative previous credits * # of months where we have some history observable for the previous credits) rows.
## POS_CASH_balance.csv
Monthly balance snapshots of previous POS (point of sales) and cash loans that the applicant had with Home Credit.
This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample – i.e. the table has (#loans in sample * # of relative previous credits * # of months in which we have some history observable for the previous credits) rows.
## credit_card_balance.csv
Monthly balance snapshots of previous credit cards that the applicant has with Home Credit.
This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample – i.e. the table has (#loans in sample * # of relative previous credit cards * # of months where we have some history observable for the previous credit card) rows.
## previous_application.csv
All previous applications for Home Credit loans of clients who have loans in our sample.
There is one row for each previous application related to loans in our data sample.
## installments_payments.csv
Repayment history for the previously disbursed credits in Home Credit related to the loans in our sample.
There is a) one row for every payment that was made plus b) one row each for missed payment.
One row is equivalent to one payment of one installment OR one installment corresponding to one payment of one previous Home Credit credit related to loans in our sample.
## HomeCredit_columns_description.csv
This file contains descriptions for the columns in the various data files.

302
subjects/ai/data-wrangling-with-pandas/README.md

@ -0,0 +1,302 @@
# Data wrangling with Pandas
Data wrangling is one of the crucial tasks in data science and analysis which includes operations like:
- Data Sorting: To rearrange values in ascending or descending order.
- Data Filtration: To create a subset of available data.
- Data Reduction: To eliminate or replace unwanted values.
- Data Access: To read or write data files.
- Data Processing: To perform aggregation, statistical, and similar operations on specific values.
Ax explained before, Pandas is an open source library, specifically developed for data science and analysis. It is built upon the Numpy (to handle numeric data in tabular form) package and has inbuilt data structures to ease-up the process of data manipulation, aka data munging/wrangling.
### Exercises of the day
- Exercice 0: Environment and libraries
- Exercise 1: Concatenate
- Exercise 2: Merge
- Exercise 3: Merge MultiIndex
- Exercise 4: Groupby Apply
- Exercise 5: Groupby Agg
- Exercise 6: Unstack
### Virtual Environment
- Python 3.x
- NumPy
- Pandas
- Jupyter or JupyterLab
- Tabulate
_Version of Pandas I used to do the exercises: 1.0.1_.
I suggest to use the most recent one.
### Resources
- https://jakevdp.github.io/PythonDataScienceHandbook/
- https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
- https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/
- https://towardsdatascience.com/different-ways-to-iterate-over-rows-in-a-pandas-dataframe-performance-comparison-dc0d5dcef8fe
---
---
# Exercise 0: Environment and libraries
The goal of this exercise is to set up the Python work environment with the required libraries.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
I recommend to use:
- the **last stable versions** of Python.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy` ,`tabulate` and `jupyter`.
---
---
# Exercise 1: Concatenate
The goal of this exercise is to learn to concatenate DataFrames. The logic is the same for the Series.
Here are the two DataFrames to concatenate:
```python
df1 = pd.DataFrame([['a', 1], ['b', 2]],
columns=['letter', 'number'])
df2 = pd.DataFrame([['c', 1], ['d', 2]],
columns=['letter', 'number'])
```
1. Concatenate this two DataFrames on index axis and reset the index. The index of the outputted should be `RangeIndex(start=0, stop=4, step=1)`. **Do not change the index manually**.
---
---
# Exercise 2: Merge
The goal of this exercise is to learn to merge DataFrames
The logic of merging DataFrames in Pandas is quite similar as the one used in SQL.
Here are the two DataFrames to merge:
```python
#df1
df1_dict = {
'id': ['1', '2', '3', '4', '5'],
'Feature1': ['A', 'C', 'E', 'G', 'I'],
'Feature2': ['B', 'D', 'F', 'H', 'J']}
df1 = pd.DataFrame(df1_dict, columns = ['id', 'Feature1', 'Feature2'])
#df2
df2_dict = {
'id': ['1', '2', '6', '7', '8'],
'Feature1': ['K', 'M', 'O', 'Q', 'S'],
'Feature2': ['L', 'N', 'P', 'R', 'T']}
df2 = pd.DataFrame(df2_dict, columns = ['id', 'Feature1', 'Feature2'])
```
1. Merge the two DataFrames to get this output:
| | id | Feature1_x | Feature2_x | Feature1_y | Feature2_y |
| --: | --: | :--------- | :--------- | :--------- | :--------- |
| 0 | 1 | A | B | K | L |
| 1 | 2 | C | D | M | N |
2. Merge the two DataFrames to get this output:
| | id | Feature1_df1 | Feature2_df1 | Feature1_df2 | Feature2_df2 |
| --: | --: | :----------- | :----------- | :----------- | :----------- |
| 0 | 1 | A | B | K | L |
| 1 | 2 | C | D | M | N |
| 2 | 3 | E | F | nan | nan |
| 3 | 4 | G | H | nan | nan |
| 4 | 5 | I | J | nan | nan |
| 5 | 6 | nan | nan | O | P |
| 6 | 7 | nan | nan | Q | R |
| 7 | 8 | nan | nan | S | T |
---
---
# Exercise 3: Merge MultiIndex
The goal of this exercise is to learn to merge DataFrames with MultiIndex.
Use the code below to generate the DataFrames. `market_data` contains fake market data. In finance, the market is available during the trading days (business days). `alternative_data` contains fake alternative data from social media. This data is available every day. But, for some reasons the Data Engineer lost the last 15 days of alternative data.
1. Using `market_data` as the reference, merge `alternative_data` on `market_data`
```python
#generate days
all_dates = pd.date_range('2021-01-01', '2021-12-15')
business_dates = pd.bdate_range('2021-01-01', '2021-12-31')
#generate tickers
tickers = ['AAPL', 'FB', 'GE', 'AMZN', 'DAI']
#create indexs
index_alt = pd.MultiIndex.from_product([all_dates, tickers], names=['Date', 'Ticker'])
index = pd.MultiIndex.from_product([business_dates, tickers], names=['Date', 'Ticker'])
# create DFs
market_data = pd.DataFrame(index=index,
data=np.random.randn(len(index), 3),
columns=['Open','Close','Close_Adjusted'])
alternative_data = pd.DataFrame(index=index_alt,
data=np.random.randn(len(index_alt), 2),
columns=['Twitter','Reddit'])
```
`reset_index` is not allowed for this question
2. Fill missing values with 0
- https://medium.com/swlh/merging-dataframes-with-pandas-pd-merge-7764c7e2d46d
---
---
# Exercise 4: Groupby Apply
The goal of this exercise is to learn to group the data and apply a function on the groups.
The use case we will work on is computing
1. Create a function that uses `pandas.DataFrame.clip` and that replace extreme values by a given percentile. The values that are greater than the upper percentile 80% are replaced by the percentile 80%. The values that are smaller than the lower percentile 20% are replaced by the percentile 20%. This process that correct outliers is called **winsorizing**.
I recommend to use NumPy to compute the percentiles to make sure we used the same default parameters.
```python
def winsorize(df, quantiles):
"""
df: pd.DataFrame
quantiles: list
ex: [0.05, 0.95]
"""
#TODO
return
```
Here is what the function should output:
```python
df = pd.DataFrame(range(1,11), columns=['sequence'])
print(winsorize(df, [0.20, 0.80]).to_markdown())
```
| | sequence |
|---:|-----------:|
| 0 | 2.8 |
| 1 | 2.8 |
| 2 | 3 |
| 3 | 4 |
| 4 | 5 |
| 5 | 6 |
| 6 | 7 |
| 7 | 8 |
| 8 | 8.2 |
| 9 | 8.2 |
2. Now we consider that each value belongs to a group. The goal is to apply the **winsorizing to each group**. In this question we use winsorizing values that are common: `[0.05,0.95]` as percentiles. Here is the new data set:
```python
groups = np.concatenate([np.ones(10), np.ones(10)+1, np.ones(10)+2, np.ones(10)+3, np.ones(10)+4])
df = pd.DataFrame(data= zip(groups,
range(1,51)),
columns=["group", "sequence"])
```
The expected output (first rows) is:
| | sequence |
| --: | -------: |
| 0 | 1.45 |
| 1 | 2 |
| 2 | 3 |
| 3 | 4 |
| 4 | 5 |
| 5 | 6 |
| 6 | 7 |
| 7 | 8 |
| 8 | 9 |
| 9 | 9.55 |
| 10 | 11.45 |
---
---
# Exercise 5: Groupby Agg
The goal of this exercise is to learn to compute different type of aggregations on the groups. This small DataFrame contains products and prices.
| | value | product |
| --: | -----: | :----------- |
| 0 | 20.45 | table |
| 1 | 22.89 | chair |
| 2 | 32.12 | chair |
| 3 | 111.22 | mobile phone |
| 4 | 33.22 | table |
| 5 | 100 | mobile phone |
| 6 | 99.99 | table |
1. Compute the min, max and mean price for each product in one single line of code. The expected output is:
| product | ('value', 'min') | ('value', 'max') | ('value', 'mean') |
| :----------- | ---------------: | ---------------: | ----------------: |
| chair | 22.89 | 32.12 | 27.505 |
| mobile phone | 100 | 111.22 | 105.61 |
| table | 20.45 | 99.99 | 51.22 |
Note: The columns don't have to be MultiIndex
---
---
# Exercise 6: Unstack
The goal of this exercise is to learn to unstack a MultiIndex
Let's assume we trained a machine learning model that predicts a daily score on the companies (tickers) below. It may be very useful to unstack the MultiIndex: plot the time series, vectorize the backtest, ...
```python
business_dates = pd.bdate_range('2021-01-01', '2021-12-31')
#generate tickers
tickers = ['AAPL', 'FB', 'GE', 'AMZN', 'DAI']
#create indexs
index = pd.MultiIndex.from_product([business_dates, tickers], names=['Date', 'Ticker'])
# create DFs
market_data = pd.DataFrame(index=index,
data=np.random.randn(len(index), 1),
columns=['Prediction'])
```
1. Unstack the DataFrame.
The first 3 rows of the DataFrame should like this:
| Date | ('Prediction', 'AAPL') | ('Prediction', 'AMZN') | ('Prediction', 'DAI') | ('Prediction', 'FB') | ('Prediction', 'GE') |
| :------------------ | ---------------------: | ---------------------: | --------------------: | -------------------: | -------------------: |
| 2021-01-01 00:00:00 | 0.382312 | -0.072392 | -0.551167 | -0.0585555 | 1.05955 |
| 2021-01-04 00:00:00 | -0.560953 | 0.503199 | -0.79517 | -3.23136 | 1.50271 |
| 2021-01-05 00:00:00 | 0.211489 | 1.84867 | 0.287906 | -1.81119 | 1.20321 |
2. Plot the 5 times series in the same plot using Pandas built-in visualization functions with a title.

168
subjects/ai/data-wrangling-with-pandas/audit/README.md

@ -0,0 +1,168 @@
#### Exercise 0: Environment and libraries
##### The exercise is validated is all questions of the exercise are validated.
##### Activate the virtual environment. If you used `conda` run `conda activate your_env`.
##### Run `python --version`.
###### Does it print `Python 3.x`? x >= 8
##### Does `import jupyter`, `import numpy` and `import pandas` run without any error?
---
---
#### Exercise 1: Concatenate
##### This question is validated if the outputted DataFrame is:
| | letter | number |
|---:|:---------|---------:|
| 0 | a | 1 |
| 1 | b | 2 |
| 2 | c | 1 |
| 3 | d | 2 |
---
---
#### Exercise 2: Merge
##### The exercise is validated is all questions of the exercise are validated.
##### The question 1 is validated if the output is:
| | id | Feature1_x | Feature2_x | Feature1_y | Feature2_y |
|---:|-----:|:-------------|:-------------|:-------------|:-------------|
| 0 | 1 | A | B | K | L |
| 1 | 2 | C | D | M | N |
##### The question 2 is validated if the output is:
| | id | Feature1_df1 | Feature2_df1 | Feature1_df2 | Feature2_df2 |
|---:|-----:|:---------------|:---------------|:---------------|:---------------|
| 0 | 1 | A | B | K | L |
| 1 | 2 | C | D | M | N |
| 2 | 3 | E | F | nan | nan |
| 3 | 4 | G | H | nan | nan |
| 4 | 5 | I | J | nan | nan |
| 5 | 6 | nan | nan | O | P |
| 6 | 7 | nan | nan | Q | R |
| 7 | 8 | nan | nan | S | T |
Note: Check that the suffixes are set using the suffix parameters rather than manually changing the columns' name.
---
---
#### Exercise 3: Merge MultiIndex
##### The exercice is validated is all questions of the exercice are validated.
##### The question 1 is validated if the outputted DataFrame's shape is `(1305, 5)` and if `merged.head()` returns a table as below. One of the answers that returns the correct DataFrame is `market_data.merge(alternative_data, how='left', left_index=True, right_index=True)`
| | Open | Close | Close_Adjusted | Twitter | Reddit |
| :--------------------------------------------------- | --------: | -------: | -------------: | ----------: | --------: |
| (Timestamp('2021-01-01 00:00:00', freq='B'), 'AAPL') | 0.0991792 | -0.31603 | 0.634787 | -0.00159041 | 1.06053 |
| (Timestamp('2021-01-01 00:00:00', freq='B'), 'FB') | -0.123753 | 1.00269 | 0.713264 | 0.0142127 | -0.487028 |
| (Timestamp('2021-01-01 00:00:00', freq='B'), 'GE') | -1.37775 | -1.01504 | 1.2858 | 0.109835 | 0.04273 |
| (Timestamp('2021-01-01 00:00:00', freq='B'), 'AMZN') | 1.06324 | 0.841241 | -0.799481 | -0.805677 | 0.511769 |
| (Timestamp('2021-01-01 00:00:00', freq='B'), 'DAI') | -0.603453 | -2.06141 | -0.969064 | 1.49817 | 0.730055 |
##### The question 2 is validated if the numbers that are missing in the DataFrame are equal to 0 and if `filled_df.sum().sum() == merged_df.sum().sum()` gives: `True`
---
---
#### Exercise 4: Groupby Apply
##### The exercise is validated is all questions of the exercise are validated and if the for loop hasn't been used. The goal is to use `groupby` and `apply`.
##### The question 1 is validated if the output is:
```python
df = pd.DataFrame(range(1,11), columns=['sequence'])
print(winsorize(df, [0.20, 0.80]).to_markdown())
```
| | sequence |
|---:|-----------:|
| 0 | 2.8 |
| 1 | 2.8 |
| 2 | 3 |
| 3 | 4 |
| 4 | 5 |
| 5 | 6 |
| 6 | 7 |
| 7 | 8 |
| 8 | 8.2 |
| 9 | 8.2 |
##### The question 2 is validated if the output is a Pandas Series or DataFrame with the first 11 rows equal to the output below. The code below give a solution.
| | sequence |
|---:|-----------:|
| 0 | 1.45 |
| 1 | 2 |
| 2 | 3 |
| 3 | 4 |
| 4 | 5 |
| 5 | 6 |
| 6 | 7 |
| 7 | 8 |
| 8 | 9 |
| 9 | 9.55 |
| 10 | 11.45 |
```python
def winsorize(df_series, quantiles):
"""
df: pd.DataFrame or pd.Series
quantiles: list [0.05, 0.95]
"""
min_value = np.quantile(df_series, quantiles[0])
max_value = np.quantile(df_series, quantiles[1])
return df_series.clip(lower = min_value, upper = max_value)
df.groupby("group")[['sequence']].apply(winsorize, [0.05,0.95])
```
- https://towardsdatascience.com/how-to-use-the-split-apply-combine-strategy-in-pandas-groupby-29e0eb44b62e
---
---
#### Exercise 5: Groupby Agg
##### The question is validated if the output is as below. The columns don't have to be MultiIndex. A solution could be `df.groupby('product').agg({'value':['min','max','mean']})`
| product | ('value', 'min') | ('value', 'max') | ('value', 'mean') |
| :----------- | ---------------: | ---------------: | ----------------: |
| chair | 22.89 | 32.12 | 27.505 |
| mobile phone | 100 | 111.22 | 105.61 |
| table | 20.45 | 99.99 | 51.22 |
---
---
#### Exercise 6: Unstack
##### The question 1 is validated if the output is similar (as the values are generated randomly, it's obvious the audit doesn't require to match the values below) to what `unstacked_df.head()`returns:
| Date | ('Prediction', 'AAPL') | ('Prediction', 'AMZN') | ('Prediction', 'DAI') | ('Prediction', 'FB') | ('Prediction', 'GE') |
|:--------------------|-------------------------:|-------------------------:|------------------------:|-----------------------:|-----------------------:|
| 2021-01-01 00:00:00 | 0.382312 | -0.072392 | -0.551167 | -0.0585555 | 1.05955 |
| 2021-01-04 00:00:00 | -0.560953 | 0.503199 | -0.79517 | -3.23136 | 1.50271 |
| 2021-01-05 00:00:00 | 0.211489 | 1.84867 | 0.287906 | -1.81119 | 1.20321 |
##### The question 2 is validated if the answer is: `unstacked.plot(title = 'Stocks 2021')`. The title can be anything else.

152
subjects/ai/emotions-detector/README.md

@ -0,0 +1,152 @@
# Emotions detection with Deep Learning
Cameras are everywhere. Videos and images have become one of the most interesting data sets for artificial intelligence.
Image processing is a quite board research area, not just filtering, compression, and enhancement. Besides, we are even interested in the question, “what is in images?”, i.e., content analysis of visual inputs, which is part of the main task of computer vision. The study of computer vision could make possible such tasks as 3D reconstruction of scenes, motion capturing, and object recognition, which are crucial for even higher-level intelligence such as
image and video understanding, and motion understanding.
For this 2 months project we will focus on two tasks:
- emotion classfication
- face tracking
With the computing power exponentially increasing the computer vision field has been developping exponentially. This is a key element because the computer power allows to use more easily a type of neural networks very powerful on images: CNN's (Convolutional Neural Networks). Before the CNN's were democratized, the algorithms used relied a lot on human analysis to extract features which obviously time consuming and not reliable. If you're interested in the "old school methodology" this article explains it: towardsdatascience.com/classifying-facial-emotions-via-machine-learning-5aac111932d3.
The history behind this field is fascinating ! Here is a short summary of its history https://kapernikov.com/basic-introduction-to-computer-vision/
### Project goal and suggested timeline
The goal of the project is to implement a **system that detects the emotion on a face from a webcam video stream**. To achieve this exciting task you'll have to understand how to:
- deal with images in Python
- detect a face in an image
- train a CNN to detect the emotion on a face
That is why I suggest to start the project with a preliminary step. The goal of this step is to understand how CNNs work and how to classify images. This preliminary step should take approximately **two weeks**.
Then starts the emotion detection in a webcam video stream step that will last until the end of the project !
The two steps are detailed below.
### Preliminary:
- Take this lesson. This course is a reference for many reasons and one of them is the creator: **Andrew Ng**. He explains the basics of CNNs but also some more advanced topics as transfer learning, siamese networks etc ... I suggest to focus on Week 1 and 2 and to spend less time on Week 3 and 4. Don't worry the time scoping of such MOOCs are conservative ;-). Here is the link: https://www.coursera.org/learn/convolutional-neural-networks . You can attend the lessons for free !
- Participate to this challenge: https://www.kaggle.com/c/digit-recognizer/code . The MNIST dataset is a reference in computer vision. Researchers use it as a benchmark to compare their models. Start first with a logistic regression to understand how to handle images in Python. And then train your first CNN on this data set.
### Face emotions classification
Emotion detection is one of the most researched topics in the modern-day machine learning arena. The ability to accurately detect and identify an emotion opens up numerous doors for Advanced Human Computer Interaction. The aim of this project is to detect up to seven distinct facial emotions in real time. This project runs on top of a Convolutional Neural Network (CNN) that is built with the help of Keras whose backend is TensorFlow in Python. The facial emotions that can be detected and classified by this system are Happy, Sad, Angry, Surprise, Fear, Disgust and Neutral.
Your goal is to implement a program that takes as input a video stream that contains a person's face and that predicts the emotion of the person.
**Step 1**: **Fit the emotion classifier**
- Train a CNN on the dataset `train.csv`. Here is an example of architecture you can implement: https://www.quora.com/What-is-the-VGG-neural-network . **The CNN has to perform more than 70% on the test set**. You will see that the CNNs take a lot of time to train. You don't want to overfit the neural network. I strongly suggest to use early stopping, callbacks and to monitor the training using the tensorboard.
You have to save the trained model in `my_own_model.pkl` and to explain the chosen architecture in `my_own_model_architecture.txt`. Use `model.summary())` to print the architecture. It is also expected that you explains the iterations and how you end up choosing your final architecture. Save a screenshot of the tensorboard while the model's training in `tensorboard.png` and save a plot with the learning curves showing the model training and stopping BEFORE the model starts overfitting in `learning_curves.png`.
- Optional: Use a pre-trained CNN to improve the accuracy. You will find some huge CNN's architecture that perform well. The issue is that it is expensive to train them from scratch. You'll need a lot of GPUs, memory and time. **Pre-trained CNNs** solve partially this issue because they are already trained on a dataset and perform well on some use cases. However, building a CNN from scratch is required, as mentioned, this step is optional and doesn't replace the first one. Similarly, save the model and explain the chosen architecture.
**Step 2**: **Classify emotions from a video stream**
- Use the video stream outputted by your computer's webcam and preprocess it to make it compatible with the CNN you trained. One of the preprocessing steps is: face detection. As you may have seen the training samples are imaged centered on a face. To do so, I suggest to use a pre-trained model to detect faces. OpenCV for image processing tasks where we identify a face from a live webcam feed which is then processed and fed into the trained neural network for emotion detection. The preprocessing pipeline will be corrected with a functional test in `preprocessing_test`:
- **Input**: Video stream of 20 sec with a face on it
- **Output**: 20 (or 21) images cropped and centered on the face with 48 x 48 grayscale pixels
- Predict at least one emotion per second from the video stream. The minimum requirement is printing in the prompt the predicted emotion with its associated probability. If there's any problem related to the webcam use as input the a recorded video stream.
For that step, I suggest again to use **OpenCV** as much as possible. This link shows how to work with a video stream with OpenCV. OpenCV documentation may become deprecated in the futur. However, OpenCV will always provide tools to work with video streams, so search on the internet for OpenCV documentation and more specifically "opencv video streams". https://docs.opencv.org/4.x/dd/d43/tutorial_py_video_display.html
- Optional: **(very cool)** Hack the CNN. Take a picture for which the prediction of your CNN is **Happy**. Now, hack the CNN: using the same image **SLIGHTLY** modified make the CNN predict **Sad**. https://medium.com/@ageitgey/machine-learning-is-fun-part-8-how-to-intentionally-trick-neural-networks-b55da32b7196
### Deliverable
```
project
│ README.md
│ environment.yml
└───data
│ │ train.csv
│ │ test.csv
│ │ xxx.csv
└───results
│ │
| |───model (free format)
│ │ │ my_own_model.pkl
│ │ │ my_own_model_architecture.txt
│ │ │ tensorboard.png
│ │ │ learning_curves.png
│ │ │ pre_trained_model.pkl (optional)
│ │ │ pre_trained_model_architecture.txt (optional)
│ │
| |───hack_cnn (free format)
│ │ │ hacked_image.png (optional)
│ │ │ input_image.png
│ │
| |───preprocessing_test
| | | input_video.mp4 (free format)
│ │ │ image0.png (free format)
│ │ │ image1.png
│ │ │ imagen.png
│ │ │ image20.png
|
|───scripts
│ │ train.py
│ │ predict.py
│ │ preprocess.py
│ │ predict_live_stream.py
│ │ hack_the_cnn.py
```
- Run **predict.py** expected output:
```prompt
python predict.py
Accuracy on test set: 72%
```
- Run **predict_live_stream.py** expected output:
```prompt
python predict_live_stream.py
Reading video stream ...
Preprocessing ...
11:11:11s : Happy , 73%
Preprocessing ...
11:11:12s : Happy , 93%
Preprocessing ...
11:11:13s : Surprise , 71%
Preprocessing ...
11:11:14s : Neutral , 82%
...
Preprocessing ...
11:13:29s : Happy , 63%
```
### Useful ressources:
- https://machinelearningmastery.com/what-is-computer-vision/
- Use a pre-trained CNN: https://arxiv.org/pdf/1812.06387.pdf
- Hack the CNN https://medium.com/@ageitgey/machine-learning-is-fun-part-8-how-to-intentionally-trick-neural-networks-b55da32b7196
- http://ice.dlut.edu.cn/valse2018/ppt/WeihongDeng_VALSE2018.pdf
- https://arxiv.org/pdf/1812.06387.pdf
### Files needed for this project
[link](https://assets.01-edu.org/ai-branch/project3/challenges-in-representation-learning-facial-expression-recognition-challenge.zip)

131
subjects/ai/emotions-detector/audit/README.md

@ -0,0 +1,131 @@
##### Computer vision
##### Preliminary
```
project
│ README.md
│ environment.yml
└───data
│ │ train.csv
│ │ test.csv
│ │ xxx.csv
└───results
│ │
| |───model (free format)
│ │ │ my_own_model.pkl
│ │ │ my_own_model_architecture.txt
│ │ │ tensorboard.png
│ │ │ learning_curves.png
│ │ │ pre_trained_model.pkl (optional)
│ │ │ pre_trained_model_architecture.txt (optional)
│ │
| |───hack_cnn (free format)
│ │ │ hacked_image.png (optional)
│ │ │ input_image.png
│ │
| |───preprocessing_test
| | | input_video.mp4 (free format)
│ │ │ image0.png (free format)
│ │ │ image1.png
│ │ │ imagen.png
│ │ │ image20.png
|
|───scripts
│ │ train.py
│ │ predict.py
│ │ preprocess.py
│ │ predict_live_stream.py
│ │ hack_the_cnn.py
```
###### Does the structure of the project is as below ?
###### Does the readme file summurize how to run the code and explain the global approach ?
###### Does the environment contain all libraries used and their versions that are necessary to run the code ?
###### Do the text files explain the chosen architectures ?
##### CNN emotion classifier
###### Is the model trained only the training set ?
###### Is the accuracy on the test set is higher than 70% ?
###### Do the learning curves prove the model the model is not overfitting ?
###### Has the training been stopped early enough to avoid the overfitting ?
###### Does the screenshot show the usage of the tensorboard to monitor the training ?
###### Does the text document explain why the architecture was chosen and what were the previous iterations ?
###### Does the following command `python predict.py ` run without any error and returns an accuracy greater than 70% ?
```prompt
python predict.py
Accuracy on test set: 72%
```
##### Face detection on the video stream
###### Does the preprocessing pipeline take as input the webcam video stream of minimum 20 sec and save in a separate folder at least 20 preprocessed\* images ?
###### Do all images contain a face ?
###### Are all images reshaped and centered on the face ?
###### Is the algorithm that detects the face imported via cv2 ?
###### Is the image converted to 48 x 48 grayscale pixels' image
###### If there's an issue related to the webcam, does the code takes as input a video recorded video stream ?
###### Does the following command `predict_live_stream.py` run without any error and return the following ?
```prompt
python predict_live_stream.py
Reading video stream ...
Preprocessing ...
11:11:11s : Happy , 73%
Preprocessing ...
11:11:12s : Happy , 93%
Preprocessing ...
11:11:13s : Surprise , 71%
Preprocessing ...
11:11:14s : Neutral , 82%
...
Preprocessing ...
11:13:29s : Happy , 63%
```
##### Hack the CNN - guidelines:
The neural network trains by updating its weights given the training error. If an image is misclassfied the neural network changes its weight to classify it correctly. The trick is to keep the neural network's weights unchanged and to modify the input pixels in order to force the neural network to predict the wanted class.
This part is validated if:
##### Choose an image from the database that gives more than 90% probability of `Happy`
###### Does the neural network modifies the input pixels to predict Sad ?
###### Can you recognize easily the chosen image ? The modified image is SLIGHTLY changed. It means that you recognies very easily the original image.
Here are three ressources that detail similar approaches:
- https://github.com/XC-Li/Facial_Expression_Recognition/tree/master/Code/RAFDB
- https://github.com/karansjc1/emotion-detection/tree/master/with%20flask
- https://www.kaggle.com/drbeanesp21/aliaj-final-facial-expression-recognition (simplified)

105
subjects/ai/forest-cover-type-prediction/README.md

@ -0,0 +1,105 @@
# Forest Cover Type Prediction
The goal of this project is to use cartographic variables to classify forest categories. You will have to analyse the data, create features and to train a machine learning model on the cartographic data to make it as accurate as possible.
### Data
The input files are `train.csv`, `test.csv` and `covtype.info`:
- `train.csv`
- `test.csv`
- `covtype.info`
The train data set is used to **analyse the data and calibrate the models**. The goal is to get the accuracy as high as possible on the test set. The test set will be available at the end of the last day to prevent from the overfitting of the test set.
The data is described in `covtype.info`.
### Structure
The structure of the project is:
```console
project
│ README.md
│ environment.yml
└───data
│ │ train.csv
│ | test.csv (not available first day)
| | covtype.info
└───notebook
│ │ EDA.ipynb
|
|───scripts
| │ preprocessing_feature_engineering.py
| │ model_selection.py
│ | predict.py
└───results
│ plots
│ test_predictions.csv
│ best_model.pkl
```
### 1. EDA and feature engineering
- Create a Jupyter Notebook to analyse the data sets and perform EDA (Exploratory Data Analysis). This notebook will not be evaluated.
- *Hint: Examples of interesting features*
- `Distance to hydrology = sqrt((Horizontal_Distance_To_Hydrology)^2 + (Vertical_Distance_To_Hydrology)^2)`
- `Horizontal_Distance_To_Fire_Points - Horizontal_Distance_To_Roadways`
### 2. Model Selection
The model selection approach is a key step because, t should return the best model and guaranty that the results are reproducible on the final test set. The goal of this step is to make sure that the results on the test set are not due to test set overfitting. It implies to split the data set as shown below:
```console
DATA
└───TRAIN FILE (0)
│ └───── Train (1)
│ | Fold0:
| | Train
| | Validation
| | Fold1:
| | Train
| | Validation
... ... ...
| |
| └───── Test (1)
└─── TEST FILE (0) (available last day)
```
**Rules:**
- Split train test
- Cross validation: at least 5 folds
- Grid search on at least 5 different models:
- Gradient Boosting, KNN, Random Forest, SVM, Logistic Regression. *Remember that for some model scaling the data is important and for others it doesn't matter.*
- Train accuracy score < **0.98**. Train set (0). Write the result in the `README.md`
- Test (last day) accuracy > **0.65**. Test set (0). Write the result in the `README.md`
- Display the confusion matrix for the best model in a DataFrame. Precise the index and columns names (True label and Predicted label)
- Plot the learning curve for the best model
- Save the trained model as a [pickle](https://www.datacamp.com/community/tutorials/pickle-python-tutorial) file
> Advice: As the grid search takes time, I suggest to prepare and test the code. Once you are confident it works, run the gridsearch at night and analyse the results
**Hint**: The confusion matrix shows the misclassifications class per class. Try to detect if the model misclassifies badly one class with another. Then, do some research on the internet on the two forest cover types, find the differences and create some new features that underline these differences. More generally, the methodology of a models learning is a cycle with several iterations. More details [here](https://serokell.io/blog/machine-learning-testing)
### 3. Predict (last day)
Once you have selected the best model and you are confident it will perform well on new data, you're ready to predict on the test set:
- Load the trained model
- Predict on the test set and compute the accuracy
- Save the predictions in a csv file
- Add your score on the `README.md`
### Files needed for this project
[link](https://assets.01-edu.org/ai-branch/piscine-ai/raid02/raid02-20221024T133335Z-001.zip)

110
subjects/ai/forest-cover-type-prediction/audit/README.md

@ -0,0 +1,110 @@
# Forest Cover Type Prediction
The goal of this project is to use cartographic variables to classify forest categories. You will have to analyse the data, create features and to train a machine learning model on the cartographic data to make it as accurate as possible.
### Preliminary
###### Does the structure of the project is as below ?
The expected structure of the project is:
```
project
│ README.md
│ environment.yml
└───data
│ │ train.csv
│ | test.csv (not available first day)
| | covtype.info
└───notebook
│ │ EDA.ipynb
|
|───scripts
| │ preprocessing_feature_engineering.py
| │ model_selection.py
│ | predict.py
└───results
│ confusion_matrix_heatmap.png
│ learning_curve_best_model.png
│ test_predictions.csv
│ best_model.pkl
```
###### Does the readme file contain a description of the project, explain how to run the code from an empty environment, give a summary of the implementation of each python file, especially details on the feature engineering which is a key step ?
###### Does the environment contain all libraries used and their versions that are necessary to run the code ?
### 1. Preprocessing and features engineering:
## 2. Model selection and predict
### Data splitting
###### Does data splitting (cross-validation) structure as follow ?
```
DATA
└───TRAIN FILE (0)
│ └───── Train (1):
│ | Fold0:
| | Train
| | Validation
| | Fold1:
| | Train
| | Validation
... ... ...
| |
| └───── Test (1)
└─── TEST FILE (0)(available last day)
```
##### The train set (0) id divised in a train set (1) and test set (1). The ratio is less than 33%.
##### The cross validation splits the train set (1) is at least 5 folds. If the cross validation is stratified that's a good point but it is not a requirement.
### Gridsearch
##### It contains at least these 5 different models: Gradient Boosting, KNN, Random Forest, SVM, Logistic Regression.
There are many options:
- 5 grid searches on 1 model
- 1 grid search on 5 models
- 1 grid search on a pipeline that contains the preprocessing
- 5 grid searches on a pipeline that contains the preprocessing
### Training
###### Is the **target is removed from the X** matrix ?
### Results
##### Run predict.py on the test set, check that: Test (last day) accuracy > **0.65**.
##### Train accuracy score < **0.98**.
It can be checked on the learning curve. If you are not sure, load the model, load the training set (0), score on the training set (0).
##### The confusion matrix is represented as a DataFrame. Example:
![alt text][confusion_matrix]
[confusion_matrix]: ../images/w2_weekend_confusion_matrix.png "Confusion matrix "
##### The learning curve for the best model is plotted. Example:
![alt text][logo_learning_curve]
[logo_learning_curve]: ../images/w2_weekend_learning_curve.png "Learning curve "
Note: The green line on the plot shows the accuracy on the validation set not on the test set (1) and not on the test set (0).
###### Is the trained model saved as a pickle file ?

BIN
subjects/ai/forest-cover-type-prediction/images/w2_weekend_confusion_matrix.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 53 KiB

BIN
subjects/ai/forest-cover-type-prediction/images/w2_weekend_learning_curve.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 79 KiB

115
subjects/ai/kaggle-titanic/README.md

@ -0,0 +1,115 @@
# Your first Kaggle: Titanic
### Introduction
The goal of this **1 week** project is to get the highest possible score on a Data Science competition. More precisely you will have to predict who survived the Titanic crash.
![alt text][titanic]
[titanic]: titanic.jpg "Titanic"
### Kaggle
Kaggle is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges. It’s a crowd-sourced platform to attract, nurture, train and challenge data scientists from all around the world to solve data science, machine learning and predictive analytics problems.
### Titanic - Machine Learning from Disaster
One of the first Kaggle competition I did was: Titanic - Machine Learning from Disaster. This is a not-to-be-missed Kaggle competition.
You can see more [here](https://www.kaggle.com/c/titanic)
The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there were not enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
In this challenge, you have to build a predictive model that answers the question: **“what sorts of people were more likely to survive?”** using passenger data (ie name, age, gender, socio-economic class, etc). **You will have to submit your prediction on Kaggle**.
### Preliminary
The way the Kaggle platform works is explained in the challenge overview page. If you need more details, I suggest this [resource](https://towardsdatascience.com/getting-started-with-kaggle-f9138b35ae18) that gives detailed explanations.
- Create a username following this structure: username*01EDU* location_MM_YYYY. Submit the description profile and push it on GitHub the first day of the week. Do not touch this file anymore.
- It is possible to have different personal accounts merged in a team for one single competition.
### Deliverables
```console
project
│ README.md
│ environment.yml
│ username.txt
└───data
│ │ train.csv
│ | test.csv
| | gender_submission.csv
└───notebook
│ │ EDA.ipynb
|
|───scripts
```
- `README.md` introduction of the project, shows the username, describes the features engineering and the best score on the **leaderboard**. Note the score on the test set using the exact same pipeline that led to the best score on the leaderboard.
- `environment.yml` contains all libraries required to run the code.
- `username.txt` contains the username, the last modified date of the file **has to correspond to the first day of the project**.
- `EDA.ipynb` contains the exploratory data analysis. This file should contain all steps of data analysis that contributed or not to improve the accuracy. It has to be commented so that the reviewer can understand the analysis and run it without any problem.
- `scripts` contains python file(s) that perform(s) the feature engineering, the model's training and prediction on the test set. It could also be one single Jupyter Notebook. It has to be commented to help the reviewers understand the approach and run the code without any bugs.
- **Submit your predictions on the Kaggle's competition platform**. Check your ranking and score in the leaderboard.
### Scores
In order to validate the project you will have to score at least **79% accuracy on the Leaderboard**:
- 79% accuracy is the minimum score to validate the project.
Scores indication:
- 79% difficult - minimum required
- 81% very difficult: smart feature engineering needed
- More than 83%: excellent that corresponds to the top 2% on Kaggle
- More than 85%: cheating
#### Cheating
It is impossible to get 100%. Who would have predicted that Rose wouldn't let [Jack on the door](https://www.insider.com/jack-and-rose-werent-on-a-door-in-titanic-2019-7) ?
All people having 100% of accuracy on the Leaderboard cheated, there's no point to compare with them or to cheat. The Kaggle community estimates that having more than 85% is almost considered as cheated submissions as they are element of luck involved in the surviving.
**You can't used external data sets than the ones provided in that competition.**
### The key points
- **Feature engineering**:
Put yourself in the shoes of an investigator trying to understand what happened exactly in that boat during the crash. Do not hesitate to watch the movie to try to find as many insights as possible. Without a smart the feature engineering there's no way to validate the project ;-)
- The leaderboard evaluates on test data for which you don't have the labels. It means that there's no point to over fit the train set. Check the over fitting on the train set by dividing the data and by cross-validating the accuracy.
### Advice
Don't try to build the perfect model the first day. Iterate a lot and test your assumptions:
Iteration 1:
- Predict all passengers die
Iteration 2
- Fit a logistic regression with a basic feature engineering
Iteration 3:
- Perform an EDA. Make assumptions and check them. Example: What if first class passengers survived more. Check the assumption through EDA and create relevant features to help the model capture the information.
- Run a gridsearch
Iteration 4:
- Good luck !

53
subjects/ai/kaggle-titanic/audit/README.md

@ -0,0 +1,53 @@
#### First Kaggle: Titanic
##### Preliminary
```
project
│ README.md
│ environment.yml
│ username.txt
└───data
│ │ train.csv
│ | test.csv
| | gender_submission.csv
└───notebook
│ │ EDA.ipynb
|
|───scripts
```
###### Does the structure of the project is as below ?
###### Does the readme file give an introduction of the project, show the username, describe the feature engineering and show the best score the on the leaderboard ?
###### Does the environment contain all libraries used and their versions that are necessary to run the code ?
##### Feature engineering
###### Does the notebook can be executed without any arror ?
###### Does the notebook explain the feature engineering that contributed to improve the accuracy ?
##### Scripts
###### Can you train the best model on the train data with feature engineering without any error ?
###### Can you predict on the test set using the best model without any error ?
###### Is the score you get **on the test set** with the best model is close to what is expected ?
##### Final score
###### Is the accuracy associated with the username in `username.txt` is higher than 79% ? The best submission score can be accessed from the user profile.
##### Examples
Here are two very good submissions explained and detailed:
- https://www.kaggle.com/konstantinmasich/titanic-0-82-0-83
- https://www.kaggle.com/sreevishnudamodaran/ultimate-eda-fe-neural-network-model-top-2

BIN
subjects/ai/kaggle-titanic/titanic.jpg

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 71 KiB

141
subjects/ai/keras-2/README.md

@ -0,0 +1,141 @@
# Keras 2
The goal of this day is to learn to use Keras to build Neural Networks and train them on small data sets. This helps to understand the specifics of networks for classification and regression.
Note:
The audit will provide the code and output because it is not straightforward to reproduce results using Keras. There are many source of randomness. Even if all the seeds are fixed to a constant they may be other source of randomness. https://machinelearningmastery.com/reproducible-results-neural-networks-keras/
### Exercises of the day
- Exercise 0: Environment and libraries
- Exercise 1: Regression - Optimize
- Exercise 2: Regression example
- Exercise 3: Multi classification - Softmax
- Exercise 4: Multi classification - Optimize
- Exercise 5: Multi classification example
### Virtual Environment
- Python 3.x
- NumPy
- Pandas
- Jupyter or JupyterLab
- Keras
_Version of Keras I used to do the exercises: 2.4.3_.
I suggest to use the most recent one.
### **Resources**
- https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
---
---
# Exercise 0: Environment and libraries
The goal of this exercise is to set up the Python work environment with the required libraries.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
I recommend to use:
- the **last stable versions** of Python.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
1. Create a virtual environment named with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy`, `jupyter` and `keras`.
---
---
# Exercise 1: Regression - Optimize
The goal of this exercise is to learn to set up the optimization for a regression neural network. There's no code to run in that exercise. In W2D2E3, we implemented a neural network designed for regression. We will be using this neural network:
```
model = keras.Sequential()
model.add(Dense(8, input_shape=(5,), activation= 'sigmoid'))
model.add(Dense(4, activation= 'sigmoid'))
model.add(Dense(1, activation= 'linear'))
```
As a reminder, the main difference between a regression and classification neural network's architecture is the output layer activation function.
1. Fill this chunk of code to set up the optimization part of the regression neural network:
```
model.compile(
optimizer='adam',
loss='',#TODO1
metrics=[''] #TODO2
)
```
Hint:
- Mean Squared Error (MSE) and Mean Absolute Error (MAE) are common loss functions used for regression problems. Mean Absolute Error is less sensitive to outliers. Different loss functions are used for classification problems. Similarly, evaluation metrics used for regression differ from classification.
https://keras.io/api/metrics/regression_metrics/
---
---
# Exercise 2: Regression example
The goal of this exercise is to learn to train a neural network to perform a regression on a data set.
The data set is Auto MPG Dataset and the go is to build a model to predict the fuel efficiency of late-1970s and early 1980s automobiles. To do this, provide the model with a description of many automobiles from that time period. This description includes attributes like: cylinders, displacement, horsepower, and weight.
https://www.tensorflow.org/tutorials/keras/regression
1. Preprocess the data set as follow:
- Drop the columns: **model year**, **origin**, **car name**
- Split train test without shuffling the data. Keep 20% for the test set.
- Scale the data using Standard Scaler
2. Train a neural network on the train set and predict on the test set. The neural network should have 2 hidden layers and the loss should be **mean_squared_error**. The expected **mean absolute error** on the test set is maximum 10.
_Hint_: inscrease the number of epochs
**Warning**: Do no forget to evaluate the neural network on the **SCALED** test set.
---
---
# Exercise 3: Multi classification - Softmax
The goal of this exercise is to learn to a neural network architecture for multi-class data. This is an important type of problem on which to practice with neural networks because the three class values require specialized handling. A multi-classification neural network uses as output layer a **softmax** layer. The **softmax** activation function is an extension of the sigmoid as it is designed to output the probabilities to belong to each class in a multi-class problem. This output layer has to contain as much neurons as classes in the multi-classification problem. This article explains in detail how it works. https://developers.google.com/machine-learning/crash-course/multi-class-neural-networks/softmax
Let us assume we want to classify images and we know they contain either apples, bears, candies, eggs or dogs (extension of the example in the link above).
1. Create the architecture for a multi-class neural network with the following architecture and return `print(model.summary())`:
- 5 inputs variables
- hidden layer 1: 16 neurons and sigmoid as activation function
- hidden layer 2: 8 neurons and sigmoid as activation function
- output layer: The number of neurons and the activation function should be adapted to this multi-classification problem.
---
---
# Exercise 4: Multi classification - Optimize
The goal of this exercise is to learn to optimize a multi-classification neural network. As learnt previously, the loss function used in binary classification is the log loss - also called in Keras `binary_crossentropy`. This function is defined for binary classification and can be extended to multi-classfication. In Keras, the extended loss that supports multi-classification is `binary_crossentropy`. There's no code to run in that exercise.
1. Fill the chunk of code below in order to optimize the neural network defined in the previous exercise. Choose the adapted loss, adam as optimizer and the accuracy as metric.
```
model.compile(loss='',#TODO1
optimizer='', #TODO2
metrics=['']) #TODO3
```
---
---

178
subjects/ai/keras-2/audit/README.md

@ -0,0 +1,178 @@
#### Exercise 0: Environment and libraries
##### The exercise is validated is all questions of the exercise are validated.
##### Activate the virtual environment. If you used `conda` run `conda activate your_env`.
##### Run `python --version`.
###### Does it print `Python 3.x`? x >= 8
##### Does `import jupyter`, `import numpy`, `import pandas` and `import keras` run without any error?
---
---
#### Exercise 1: Regression - Optimize
##### The question 1 is validated if the chunk of code is:
```
model.compile(
optimizer='adam',
loss='mse',
metrics=['mse']
)
```
All regression metrics or losses used are correct. As explained before, the loss functions are chosen thanks to nice mathematical properties. That is why most of the time the loss function used for regression is the MSE or MAE.
https://keras.io/api/losses/regression_losses/
https://keras.io/api/metrics/regression_metrics/
---
---
#### Exercise 2: Regression example
##### The exercice is validated is all questions of the exercice are validated
##### The question 1 is validated if the input DataFrames are:
X_train_scaled shape is (313, 5) and the first 5 rows are:
| | cylinders | displacement | horsepower | weight | acceleration |
| --: | --------: | -----------: | ---------: | -------: | -----------: |
| 0 | 1.28377 | 0.884666 | 0.48697 | 0.455708 | -1.19481 |
| 1 | 1.28377 | 1.28127 | 1.36238 | 0.670459 | -1.37737 |
| 2 | 1.28377 | 0.986124 | 0.987205 | 0.378443 | -1.55992 |
| 3 | 1.28377 | 0.856996 | 0.987205 | 0.375034 | -1.19481 |
| 4 | 1.28377 | 0.838549 | 0.737087 | 0.393214 | -1.74247 |
The train target is:
| | mpg |
| --: | --: |
| 0 | 18 |
| 1 | 15 |
| 2 | 18 |
| 3 | 16 |
| 4 | 17 |
X_test_scaled shape is (79, 5) and the first 5 rows are:
| | cylinders | displacement | horsepower | weight | acceleration |
| --: | --------: | -----------: | ---------: | --------: | -----------: |
| 315 | -1.00255 | -0.554185 | -0.5135 | -0.113552 | 1.76253 |
| 316 | 0.140612 | 0.128347 | -0.5135 | 0.31595 | 1.25139 |
| 317 | -1.00255 | -1.05225 | -0.813641 | -1.03959 | 0.192584 |
| 318 | -1.00255 | -0.710983 | -0.5135 | -0.445337 | 0.0830525 |
| 319 | -1.00255 | -0.840111 | -0.888676 | -0.637363 | 0.813262 |
The test target is:
| | mpg |
| --: | ---: |
| 315 | 24.3 |
| 316 | 19.1 |
| 317 | 34.3 |
| 318 | 29.8 |
| 319 | 31.3 |
##### The question 2 is validated if the mean absolute error on the test set is smaller than 10. Here is an architecture that works:
```
# create model
model = Sequential()
model.add(Dense(30, input_dim=5, activation='sigmoid'))
model.add(Dense(30, activation='sigmoid'))
model.add(Dense(1))
# Compile model
model.compile(loss='mean_squared_error',
optimizer='adam', metrics='mean_absolute_error')
```
The output neuron has to be `Dense(1)` - by defaut the activation funtion is linear. The loss has to be **mean_squared_error** and the **input_dim** has to be **5**. All variations on the others parameters are accepted.
_Hint_: To get the score on the test set, `evaluate` could have been used: `model.evaluate(X_test_scaled, y_test)`.
---
---
#### Exercise 3: Multi classification - Softmax
##### The question 1 is validated if the code that creates the neural network is:
```
model = keras.Sequential()
model.add(Dense(16, input_shape=(5,), activation= 'sigmoid'))
model.add(Dense(8, activation= 'sigmoid'))
model.add(Dense(5, activation= 'softmax'))
```
---
---
#### Exercise 4: Multi classification - Optimize
##### The question 1 is validated if the chunk of code is:
```
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
```
---
---
#### Exercise 4: Multi classification - Optimize
##### The question 1 is validated if the chunk of code is:
```
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
```
---
---
#### Exercise 5: Multi classification example
##### The exercice is validated is all questions of the exercice are validated
##### The question 1 is validated if the output of the first ten values of the train labels are:
```
array([[0, 1, 0],
[0, 0, 1],
[0, 1, 0],
[0, 0, 1],
[0, 0, 1],
[1, 0, 0],
[0, 1, 0],
[1, 0, 0],
[0, 1, 0],
[0, 0, 1]])
```
##### The question 2 is validated if the accuracy on the test set is bigger than 90%. To evaluate the accuracy on the test set you can use: `model.evaluate(X_test_sc, y_test_multi_class)`.
Here is an implementation that gives 96% accuracy on the test set.
```
model = Sequential()
model.add(Dense(10, input_dim=4, activation='sigmoid'))
model.add(Dense(3, activation='softmax'))
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train_sc, y_train_multi_class, epochs = 1000, batch_size=20)
```

399
subjects/ai/keras-2/auto-mpg.csv

@ -0,0 +1,399 @@
mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
18,8,307,130,3504,12,70,1,chevrolet chevelle malibu
15,8,350,165,3693,11.5,70,1,buick skylark 320
18,8,318,150,3436,11,70,1,plymouth satellite
16,8,304,150,3433,12,70,1,amc rebel sst
17,8,302,140,3449,10.5,70,1,ford torino
15,8,429,198,4341,10,70,1,ford galaxie 500
14,8,454,220,4354,9,70,1,chevrolet impala
14,8,440,215,4312,8.5,70,1,plymouth fury iii
14,8,455,225,4425,10,70,1,pontiac catalina
15,8,390,190,3850,8.5,70,1,amc ambassador dpl
15,8,383,170,3563,10,70,1,dodge challenger se
14,8,340,160,3609,8,70,1,plymouth 'cuda 340
15,8,400,150,3761,9.5,70,1,chevrolet monte carlo
14,8,455,225,3086,10,70,1,buick estate wagon (sw)
24,4,113,95,2372,15,70,3,toyota corona mark ii
22,6,198,95,2833,15.5,70,1,plymouth duster
18,6,199,97,2774,15.5,70,1,amc hornet
21,6,200,85,2587,16,70,1,ford maverick
27,4,97,88,2130,14.5,70,3,datsun pl510
26,4,97,46,1835,20.5,70,2,volkswagen 1131 deluxe sedan
25,4,110,87,2672,17.5,70,2,peugeot 504
24,4,107,90,2430,14.5,70,2,audi 100 ls
25,4,104,95,2375,17.5,70,2,saab 99e
26,4,121,113,2234,12.5,70,2,bmw 2002
21,6,199,90,2648,15,70,1,amc gremlin
10,8,360,215,4615,14,70,1,ford f250
10,8,307,200,4376,15,70,1,chevy c20
11,8,318,210,4382,13.5,70,1,dodge d200
9,8,304,193,4732,18.5,70,1,hi 1200d
27,4,97,88,2130,14.5,71,3,datsun pl510
28,4,140,90,2264,15.5,71,1,chevrolet vega 2300
25,4,113,95,2228,14,71,3,toyota corona
25,4,98,?,2046,19,71,1,ford pinto
19,6,232,100,2634,13,71,1,amc gremlin
16,6,225,105,3439,15.5,71,1,plymouth satellite custom
17,6,250,100,3329,15.5,71,1,chevrolet chevelle malibu
19,6,250,88,3302,15.5,71,1,ford torino 500
18,6,232,100,3288,15.5,71,1,amc matador
14,8,350,165,4209,12,71,1,chevrolet impala
14,8,400,175,4464,11.5,71,1,pontiac catalina brougham
14,8,351,153,4154,13.5,71,1,ford galaxie 500
14,8,318,150,4096,13,71,1,plymouth fury iii
12,8,383,180,4955,11.5,71,1,dodge monaco (sw)
13,8,400,170,4746,12,71,1,ford country squire (sw)
13,8,400,175,5140,12,71,1,pontiac safari (sw)
18,6,258,110,2962,13.5,71,1,amc hornet sportabout (sw)
22,4,140,72,2408,19,71,1,chevrolet vega (sw)
19,6,250,100,3282,15,71,1,pontiac firebird
18,6,250,88,3139,14.5,71,1,ford mustang
23,4,122,86,2220,14,71,1,mercury capri 2000
28,4,116,90,2123,14,71,2,opel 1900
30,4,79,70,2074,19.5,71,2,peugeot 304
30,4,88,76,2065,14.5,71,2,fiat 124b
31,4,71,65,1773,19,71,3,toyota corolla 1200
35,4,72,69,1613,18,71,3,datsun 1200
27,4,97,60,1834,19,71,2,volkswagen model 111
26,4,91,70,1955,20.5,71,1,plymouth cricket
24,4,113,95,2278,15.5,72,3,toyota corona hardtop
25,4,97.5,80,2126,17,72,1,dodge colt hardtop
23,4,97,54,2254,23.5,72,2,volkswagen type 3
20,4,140,90,2408,19.5,72,1,chevrolet vega
21,4,122,86,2226,16.5,72,1,ford pinto runabout
13,8,350,165,4274,12,72,1,chevrolet impala
14,8,400,175,4385,12,72,1,pontiac catalina
15,8,318,150,4135,13.5,72,1,plymouth fury iii
14,8,351,153,4129,13,72,1,ford galaxie 500
17,8,304,150,3672,11.5,72,1,amc ambassador sst
11,8,429,208,4633,11,72,1,mercury marquis
13,8,350,155,4502,13.5,72,1,buick lesabre custom
12,8,350,160,4456,13.5,72,1,oldsmobile delta 88 royale
13,8,400,190,4422,12.5,72,1,chrysler newport royal
19,3,70,97,2330,13.5,72,3,mazda rx2 coupe
15,8,304,150,3892,12.5,72,1,amc matador (sw)
13,8,307,130,4098,14,72,1,chevrolet chevelle concours (sw)
13,8,302,140,4294,16,72,1,ford gran torino (sw)
14,8,318,150,4077,14,72,1,plymouth satellite custom (sw)
18,4,121,112,2933,14.5,72,2,volvo 145e (sw)
22,4,121,76,2511,18,72,2,volkswagen 411 (sw)
21,4,120,87,2979,19.5,72,2,peugeot 504 (sw)
26,4,96,69,2189,18,72,2,renault 12 (sw)
22,4,122,86,2395,16,72,1,ford pinto (sw)
28,4,97,92,2288,17,72,3,datsun 510 (sw)
23,4,120,97,2506,14.5,72,3,toyouta corona mark ii (sw)
28,4,98,80,2164,15,72,1,dodge colt (sw)
27,4,97,88,2100,16.5,72,3,toyota corolla 1600 (sw)
13,8,350,175,4100,13,73,1,buick century 350
14,8,304,150,3672,11.5,73,1,amc matador
13,8,350,145,3988,13,73,1,chevrolet malibu
14,8,302,137,4042,14.5,73,1,ford gran torino
15,8,318,150,3777,12.5,73,1,dodge coronet custom
12,8,429,198,4952,11.5,73,1,mercury marquis brougham
13,8,400,150,4464,12,73,1,chevrolet caprice classic
13,8,351,158,4363,13,73,1,ford ltd
14,8,318,150,4237,14.5,73,1,plymouth fury gran sedan
13,8,440,215,4735,11,73,1,chrysler new yorker brougham
12,8,455,225,4951,11,73,1,buick electra 225 custom
13,8,360,175,3821,11,73,1,amc ambassador brougham
18,6,225,105,3121,16.5,73,1,plymouth valiant
16,6,250,100,3278,18,73,1,chevrolet nova custom
18,6,232,100,2945,16,73,1,amc hornet
18,6,250,88,3021,16.5,73,1,ford maverick
23,6,198,95,2904,16,73,1,plymouth duster
26,4,97,46,1950,21,73,2,volkswagen super beetle
11,8,400,150,4997,14,73,1,chevrolet impala
12,8,400,167,4906,12.5,73,1,ford country
13,8,360,170,4654,13,73,1,plymouth custom suburb
12,8,350,180,4499,12.5,73,1,oldsmobile vista cruiser
18,6,232,100,2789,15,73,1,amc gremlin
20,4,97,88,2279,19,73,3,toyota carina
21,4,140,72,2401,19.5,73,1,chevrolet vega
22,4,108,94,2379,16.5,73,3,datsun 610
18,3,70,90,2124,13.5,73,3,maxda rx3
19,4,122,85,2310,18.5,73,1,ford pinto
21,6,155,107,2472,14,73,1,mercury capri v6
26,4,98,90,2265,15.5,73,2,fiat 124 sport coupe
15,8,350,145,4082,13,73,1,chevrolet monte carlo s
16,8,400,230,4278,9.5,73,1,pontiac grand prix
29,4,68,49,1867,19.5,73,2,fiat 128
24,4,116,75,2158,15.5,73,2,opel manta
20,4,114,91,2582,14,73,2,audi 100ls
19,4,121,112,2868,15.5,73,2,volvo 144ea
15,8,318,150,3399,11,73,1,dodge dart custom
24,4,121,110,2660,14,73,2,saab 99le
20,6,156,122,2807,13.5,73,3,toyota mark ii
11,8,350,180,3664,11,73,1,oldsmobile omega
20,6,198,95,3102,16.5,74,1,plymouth duster
21,6,200,?,2875,17,74,1,ford maverick
19,6,232,100,2901,16,74,1,amc hornet
15,6,250,100,3336,17,74,1,chevrolet nova
31,4,79,67,1950,19,74,3,datsun b210
26,4,122,80,2451,16.5,74,1,ford pinto
32,4,71,65,1836,21,74,3,toyota corolla 1200
25,4,140,75,2542,17,74,1,chevrolet vega
16,6,250,100,3781,17,74,1,chevrolet chevelle malibu classic
16,6,258,110,3632,18,74,1,amc matador
18,6,225,105,3613,16.5,74,1,plymouth satellite sebring
16,8,302,140,4141,14,74,1,ford gran torino
13,8,350,150,4699,14.5,74,1,buick century luxus (sw)
14,8,318,150,4457,13.5,74,1,dodge coronet custom (sw)
14,8,302,140,4638,16,74,1,ford gran torino (sw)
14,8,304,150,4257,15.5,74,1,amc matador (sw)
29,4,98,83,2219,16.5,74,2,audi fox
26,4,79,67,1963,15.5,74,2,volkswagen dasher
26,4,97,78,2300,14.5,74,2,opel manta
31,4,76,52,1649,16.5,74,3,toyota corona
32,4,83,61,2003,19,74,3,datsun 710
28,4,90,75,2125,14.5,74,1,dodge colt
24,4,90,75,2108,15.5,74,2,fiat 128
26,4,116,75,2246,14,74,2,fiat 124 tc
24,4,120,97,2489,15,74,3,honda civic
26,4,108,93,2391,15.5,74,3,subaru
31,4,79,67,2000,16,74,2,fiat x1.9
19,6,225,95,3264,16,75,1,plymouth valiant custom
18,6,250,105,3459,16,75,1,chevrolet nova
15,6,250,72,3432,21,75,1,mercury monarch
15,6,250,72,3158,19.5,75,1,ford maverick
16,8,400,170,4668,11.5,75,1,pontiac catalina
15,8,350,145,4440,14,75,1,chevrolet bel air
16,8,318,150,4498,14.5,75,1,plymouth grand fury
14,8,351,148,4657,13.5,75,1,ford ltd
17,6,231,110,3907,21,75,1,buick century
16,6,250,105,3897,18.5,75,1,chevroelt chevelle malibu
15,6,258,110,3730,19,75,1,amc matador
18,6,225,95,3785,19,75,1,plymouth fury
21,6,231,110,3039,15,75,1,buick skyhawk
20,8,262,110,3221,13.5,75,1,chevrolet monza 2+2
13,8,302,129,3169,12,75,1,ford mustang ii
29,4,97,75,2171,16,75,3,toyota corolla
23,4,140,83,2639,17,75,1,ford pinto
20,6,232,100,2914,16,75,1,amc gremlin
23,4,140,78,2592,18.5,75,1,pontiac astro
24,4,134,96,2702,13.5,75,3,toyota corona
25,4,90,71,2223,16.5,75,2,volkswagen dasher
24,4,119,97,2545,17,75,3,datsun 710
18,6,171,97,2984,14.5,75,1,ford pinto
29,4,90,70,1937,14,75,2,volkswagen rabbit
19,6,232,90,3211,17,75,1,amc pacer
23,4,115,95,2694,15,75,2,audi 100ls
23,4,120,88,2957,17,75,2,peugeot 504
22,4,121,98,2945,14.5,75,2,volvo 244dl
25,4,121,115,2671,13.5,75,2,saab 99le
33,4,91,53,1795,17.5,75,3,honda civic cvcc
28,4,107,86,2464,15.5,76,2,fiat 131
25,4,116,81,2220,16.9,76,2,opel 1900
25,4,140,92,2572,14.9,76,1,capri ii
26,4,98,79,2255,17.7,76,1,dodge colt
27,4,101,83,2202,15.3,76,2,renault 12tl
17.5,8,305,140,4215,13,76,1,chevrolet chevelle malibu classic
16,8,318,150,4190,13,76,1,dodge coronet brougham
15.5,8,304,120,3962,13.9,76,1,amc matador
14.5,8,351,152,4215,12.8,76,1,ford gran torino
22,6,225,100,3233,15.4,76,1,plymouth valiant
22,6,250,105,3353,14.5,76,1,chevrolet nova
24,6,200,81,3012,17.6,76,1,ford maverick
22.5,6,232,90,3085,17.6,76,1,amc hornet
29,4,85,52,2035,22.2,76,1,chevrolet chevette
24.5,4,98,60,2164,22.1,76,1,chevrolet woody
29,4,90,70,1937,14.2,76,2,vw rabbit
33,4,91,53,1795,17.4,76,3,honda civic
20,6,225,100,3651,17.7,76,1,dodge aspen se
18,6,250,78,3574,21,76,1,ford granada ghia
18.5,6,250,110,3645,16.2,76,1,pontiac ventura sj
17.5,6,258,95,3193,17.8,76,1,amc pacer d/l
29.5,4,97,71,1825,12.2,76,2,volkswagen rabbit
32,4,85,70,1990,17,76,3,datsun b-210
28,4,97,75,2155,16.4,76,3,toyota corolla
26.5,4,140,72,2565,13.6,76,1,ford pinto
20,4,130,102,3150,15.7,76,2,volvo 245
13,8,318,150,3940,13.2,76,1,plymouth volare premier v8
19,4,120,88,3270,21.9,76,2,peugeot 504
19,6,156,108,2930,15.5,76,3,toyota mark ii
16.5,6,168,120,3820,16.7,76,2,mercedes-benz 280s
16.5,8,350,180,4380,12.1,76,1,cadillac seville
13,8,350,145,4055,12,76,1,chevy c10
13,8,302,130,3870,15,76,1,ford f108
13,8,318,150,3755,14,76,1,dodge d100
31.5,4,98,68,2045,18.5,77,3,honda accord cvcc
30,4,111,80,2155,14.8,77,1,buick opel isuzu deluxe
36,4,79,58,1825,18.6,77,2,renault 5 gtl
25.5,4,122,96,2300,15.5,77,1,plymouth arrow gs
33.5,4,85,70,1945,16.8,77,3,datsun f-10 hatchback
17.5,8,305,145,3880,12.5,77,1,chevrolet caprice classic
17,8,260,110,4060,19,77,1,oldsmobile cutlass supreme
15.5,8,318,145,4140,13.7,77,1,dodge monaco brougham
15,8,302,130,4295,14.9,77,1,mercury cougar brougham
17.5,6,250,110,3520,16.4,77,1,chevrolet concours
20.5,6,231,105,3425,16.9,77,1,buick skylark
19,6,225,100,3630,17.7,77,1,plymouth volare custom
18.5,6,250,98,3525,19,77,1,ford granada
16,8,400,180,4220,11.1,77,1,pontiac grand prix lj
15.5,8,350,170,4165,11.4,77,1,chevrolet monte carlo landau
15.5,8,400,190,4325,12.2,77,1,chrysler cordoba
16,8,351,149,4335,14.5,77,1,ford thunderbird
29,4,97,78,1940,14.5,77,2,volkswagen rabbit custom
24.5,4,151,88,2740,16,77,1,pontiac sunbird coupe
26,4,97,75,2265,18.2,77,3,toyota corolla liftback
25.5,4,140,89,2755,15.8,77,1,ford mustang ii 2+2
30.5,4,98,63,2051,17,77,1,chevrolet chevette
33.5,4,98,83,2075,15.9,77,1,dodge colt m/m
30,4,97,67,1985,16.4,77,3,subaru dl
30.5,4,97,78,2190,14.1,77,2,volkswagen dasher
22,6,146,97,2815,14.5,77,3,datsun 810
21.5,4,121,110,2600,12.8,77,2,bmw 320i
21.5,3,80,110,2720,13.5,77,3,mazda rx-4
43.1,4,90,48,1985,21.5,78,2,volkswagen rabbit custom diesel
36.1,4,98,66,1800,14.4,78,1,ford fiesta
32.8,4,78,52,1985,19.4,78,3,mazda glc deluxe
39.4,4,85,70,2070,18.6,78,3,datsun b210 gx
36.1,4,91,60,1800,16.4,78,3,honda civic cvcc
19.9,8,260,110,3365,15.5,78,1,oldsmobile cutlass salon brougham
19.4,8,318,140,3735,13.2,78,1,dodge diplomat
20.2,8,302,139,3570,12.8,78,1,mercury monarch ghia
19.2,6,231,105,3535,19.2,78,1,pontiac phoenix lj
20.5,6,200,95,3155,18.2,78,1,chevrolet malibu
20.2,6,200,85,2965,15.8,78,1,ford fairmont (auto)
25.1,4,140,88,2720,15.4,78,1,ford fairmont (man)
20.5,6,225,100,3430,17.2,78,1,plymouth volare
19.4,6,232,90,3210,17.2,78,1,amc concord
20.6,6,231,105,3380,15.8,78,1,buick century special
20.8,6,200,85,3070,16.7,78,1,mercury zephyr
18.6,6,225,110,3620,18.7,78,1,dodge aspen
18.1,6,258,120,3410,15.1,78,1,amc concord d/l
19.2,8,305,145,3425,13.2,78,1,chevrolet monte carlo landau
17.7,6,231,165,3445,13.4,78,1,buick regal sport coupe (turbo)
18.1,8,302,139,3205,11.2,78,1,ford futura
17.5,8,318,140,4080,13.7,78,1,dodge magnum xe
30,4,98,68,2155,16.5,78,1,chevrolet chevette
27.5,4,134,95,2560,14.2,78,3,toyota corona
27.2,4,119,97,2300,14.7,78,3,datsun 510
30.9,4,105,75,2230,14.5,78,1,dodge omni
21.1,4,134,95,2515,14.8,78,3,toyota celica gt liftback
23.2,4,156,105,2745,16.7,78,1,plymouth sapporo
23.8,4,151,85,2855,17.6,78,1,oldsmobile starfire sx
23.9,4,119,97,2405,14.9,78,3,datsun 200-sx
20.3,5,131,103,2830,15.9,78,2,audi 5000
17,6,163,125,3140,13.6,78,2,volvo 264gl
21.6,4,121,115,2795,15.7,78,2,saab 99gle
16.2,6,163,133,3410,15.8,78,2,peugeot 604sl
31.5,4,89,71,1990,14.9,78,2,volkswagen scirocco
29.5,4,98,68,2135,16.6,78,3,honda accord lx
21.5,6,231,115,3245,15.4,79,1,pontiac lemans v6
19.8,6,200,85,2990,18.2,79,1,mercury zephyr 6
22.3,4,140,88,2890,17.3,79,1,ford fairmont 4
20.2,6,232,90,3265,18.2,79,1,amc concord dl 6
20.6,6,225,110,3360,16.6,79,1,dodge aspen 6
17,8,305,130,3840,15.4,79,1,chevrolet caprice classic
17.6,8,302,129,3725,13.4,79,1,ford ltd landau
16.5,8,351,138,3955,13.2,79,1,mercury grand marquis
18.2,8,318,135,3830,15.2,79,1,dodge st. regis
16.9,8,350,155,4360,14.9,79,1,buick estate wagon (sw)
15.5,8,351,142,4054,14.3,79,1,ford country squire (sw)
19.2,8,267,125,3605,15,79,1,chevrolet malibu classic (sw)
18.5,8,360,150,3940,13,79,1,chrysler lebaron town @ country (sw)
31.9,4,89,71,1925,14,79,2,vw rabbit custom
34.1,4,86,65,1975,15.2,79,3,maxda glc deluxe
35.7,4,98,80,1915,14.4,79,1,dodge colt hatchback custom
27.4,4,121,80,2670,15,79,1,amc spirit dl
25.4,5,183,77,3530,20.1,79,2,mercedes benz 300d
23,8,350,125,3900,17.4,79,1,cadillac eldorado
27.2,4,141,71,3190,24.8,79,2,peugeot 504
23.9,8,260,90,3420,22.2,79,1,oldsmobile cutlass salon brougham
34.2,4,105,70,2200,13.2,79,1,plymouth horizon
34.5,4,105,70,2150,14.9,79,1,plymouth horizon tc3
31.8,4,85,65,2020,19.2,79,3,datsun 210
37.3,4,91,69,2130,14.7,79,2,fiat strada custom
28.4,4,151,90,2670,16,79,1,buick skylark limited
28.8,6,173,115,2595,11.3,79,1,chevrolet citation
26.8,6,173,115,2700,12.9,79,1,oldsmobile omega brougham
33.5,4,151,90,2556,13.2,79,1,pontiac phoenix
41.5,4,98,76,2144,14.7,80,2,vw rabbit
38.1,4,89,60,1968,18.8,80,3,toyota corolla tercel
32.1,4,98,70,2120,15.5,80,1,chevrolet chevette
37.2,4,86,65,2019,16.4,80,3,datsun 310
28,4,151,90,2678,16.5,80,1,chevrolet citation
26.4,4,140,88,2870,18.1,80,1,ford fairmont
24.3,4,151,90,3003,20.1,80,1,amc concord
19.1,6,225,90,3381,18.7,80,1,dodge aspen
34.3,4,97,78,2188,15.8,80,2,audi 4000
29.8,4,134,90,2711,15.5,80,3,toyota corona liftback
31.3,4,120,75,2542,17.5,80,3,mazda 626
37,4,119,92,2434,15,80,3,datsun 510 hatchback
32.2,4,108,75,2265,15.2,80,3,toyota corolla
46.6,4,86,65,2110,17.9,80,3,mazda glc
27.9,4,156,105,2800,14.4,80,1,dodge colt
40.8,4,85,65,2110,19.2,80,3,datsun 210
44.3,4,90,48,2085,21.7,80,2,vw rabbit c (diesel)
43.4,4,90,48,2335,23.7,80,2,vw dasher (diesel)
36.4,5,121,67,2950,19.9,80,2,audi 5000s (diesel)
30,4,146,67,3250,21.8,80,2,mercedes-benz 240d
44.6,4,91,67,1850,13.8,80,3,honda civic 1500 gl
40.9,4,85,?,1835,17.3,80,2,renault lecar deluxe
33.8,4,97,67,2145,18,80,3,subaru dl
29.8,4,89,62,1845,15.3,80,2,vokswagen rabbit
32.7,6,168,132,2910,11.4,80,3,datsun 280-zx
23.7,3,70,100,2420,12.5,80,3,mazda rx-7 gs
35,4,122,88,2500,15.1,80,2,triumph tr7 coupe
23.6,4,140,?,2905,14.3,80,1,ford mustang cobra
32.4,4,107,72,2290,17,80,3,honda accord
27.2,4,135,84,2490,15.7,81,1,plymouth reliant
26.6,4,151,84,2635,16.4,81,1,buick skylark
25.8,4,156,92,2620,14.4,81,1,dodge aries wagon (sw)
23.5,6,173,110,2725,12.6,81,1,chevrolet citation
30,4,135,84,2385,12.9,81,1,plymouth reliant
39.1,4,79,58,1755,16.9,81,3,toyota starlet
39,4,86,64,1875,16.4,81,1,plymouth champ
35.1,4,81,60,1760,16.1,81,3,honda civic 1300
32.3,4,97,67,2065,17.8,81,3,subaru
37,4,85,65,1975,19.4,81,3,datsun 210 mpg
37.7,4,89,62,2050,17.3,81,3,toyota tercel
34.1,4,91,68,1985,16,81,3,mazda glc 4
34.7,4,105,63,2215,14.9,81,1,plymouth horizon 4
34.4,4,98,65,2045,16.2,81,1,ford escort 4w
29.9,4,98,65,2380,20.7,81,1,ford escort 2h
33,4,105,74,2190,14.2,81,2,volkswagen jetta
34.5,4,100,?,2320,15.8,81,2,renault 18i
33.7,4,107,75,2210,14.4,81,3,honda prelude
32.4,4,108,75,2350,16.8,81,3,toyota corolla
32.9,4,119,100,2615,14.8,81,3,datsun 200sx
31.6,4,120,74,2635,18.3,81,3,mazda 626
28.1,4,141,80,3230,20.4,81,2,peugeot 505s turbo diesel
30.7,6,145,76,3160,19.6,81,2,volvo diesel
25.4,6,168,116,2900,12.6,81,3,toyota cressida
24.2,6,146,120,2930,13.8,81,3,datsun 810 maxima
22.4,6,231,110,3415,15.8,81,1,buick century
26.6,8,350,105,3725,19,81,1,oldsmobile cutlass ls
20.2,6,200,88,3060,17.1,81,1,ford granada gl
17.6,6,225,85,3465,16.6,81,1,chrysler lebaron salon
28,4,112,88,2605,19.6,82,1,chevrolet cavalier
27,4,112,88,2640,18.6,82,1,chevrolet cavalier wagon
34,4,112,88,2395,18,82,1,chevrolet cavalier 2-door
31,4,112,85,2575,16.2,82,1,pontiac j2000 se hatchback
29,4,135,84,2525,16,82,1,dodge aries se
27,4,151,90,2735,18,82,1,pontiac phoenix
24,4,140,92,2865,16.4,82,1,ford fairmont futura
23,4,151,?,3035,20.5,82,1,amc concord dl
36,4,105,74,1980,15.3,82,2,volkswagen rabbit l
37,4,91,68,2025,18.2,82,3,mazda glc custom l
31,4,91,68,1970,17.6,82,3,mazda glc custom
38,4,105,63,2125,14.7,82,1,plymouth horizon miser
36,4,98,70,2125,17.3,82,1,mercury lynx l
36,4,120,88,2160,14.5,82,3,nissan stanza xe
36,4,107,75,2205,14.5,82,3,honda accord
34,4,108,70,2245,16.9,82,3,toyota corolla
38,4,91,67,1965,15,82,3,honda civic
32,4,91,67,1965,15.7,82,3,honda civic (auto)
38,4,91,67,1995,16.2,82,3,datsun 310 gx
25,6,181,110,2945,16.4,82,1,buick century limited
38,6,262,85,3015,17,82,1,oldsmobile cutlass ciera (diesel)
26,4,156,92,2585,14.5,82,1,chrysler lebaron medallion
22,6,232,112,2835,14.7,82,1,ford granada l
32,4,144,96,2665,13.9,82,3,toyota celica gt
36,4,135,84,2370,13,82,1,dodge charger 2.2
27,4,151,90,2950,17.3,82,1,chevrolet camaro
27,4,140,86,2790,15.6,82,1,ford mustang gl
44,4,97,52,2130,24.6,82,2,vw pickup
32,4,135,84,2295,11.6,82,1,dodge rampage
28,4,120,79,2625,18.6,82,1,ford ranger
31,4,119,82,2720,19.4,82,1,chevy s-10
1 mpg cylinders displacement horsepower weight acceleration model year origin car name
2 18 8 307 130 3504 12 70 1 chevrolet chevelle malibu
3 15 8 350 165 3693 11.5 70 1 buick skylark 320
4 18 8 318 150 3436 11 70 1 plymouth satellite
5 16 8 304 150 3433 12 70 1 amc rebel sst
6 17 8 302 140 3449 10.5 70 1 ford torino
7 15 8 429 198 4341 10 70 1 ford galaxie 500
8 14 8 454 220 4354 9 70 1 chevrolet impala
9 14 8 440 215 4312 8.5 70 1 plymouth fury iii
10 14 8 455 225 4425 10 70 1 pontiac catalina
11 15 8 390 190 3850 8.5 70 1 amc ambassador dpl
12 15 8 383 170 3563 10 70 1 dodge challenger se
13 14 8 340 160 3609 8 70 1 plymouth 'cuda 340
14 15 8 400 150 3761 9.5 70 1 chevrolet monte carlo
15 14 8 455 225 3086 10 70 1 buick estate wagon (sw)
16 24 4 113 95 2372 15 70 3 toyota corona mark ii
17 22 6 198 95 2833 15.5 70 1 plymouth duster
18 18 6 199 97 2774 15.5 70 1 amc hornet
19 21 6 200 85 2587 16 70 1 ford maverick
20 27 4 97 88 2130 14.5 70 3 datsun pl510
21 26 4 97 46 1835 20.5 70 2 volkswagen 1131 deluxe sedan
22 25 4 110 87 2672 17.5 70 2 peugeot 504
23 24 4 107 90 2430 14.5 70 2 audi 100 ls
24 25 4 104 95 2375 17.5 70 2 saab 99e
25 26 4 121 113 2234 12.5 70 2 bmw 2002
26 21 6 199 90 2648 15 70 1 amc gremlin
27 10 8 360 215 4615 14 70 1 ford f250
28 10 8 307 200 4376 15 70 1 chevy c20
29 11 8 318 210 4382 13.5 70 1 dodge d200
30 9 8 304 193 4732 18.5 70 1 hi 1200d
31 27 4 97 88 2130 14.5 71 3 datsun pl510
32 28 4 140 90 2264 15.5 71 1 chevrolet vega 2300
33 25 4 113 95 2228 14 71 3 toyota corona
34 25 4 98 ? 2046 19 71 1 ford pinto
35 19 6 232 100 2634 13 71 1 amc gremlin
36 16 6 225 105 3439 15.5 71 1 plymouth satellite custom
37 17 6 250 100 3329 15.5 71 1 chevrolet chevelle malibu
38 19 6 250 88 3302 15.5 71 1 ford torino 500
39 18 6 232 100 3288 15.5 71 1 amc matador
40 14 8 350 165 4209 12 71 1 chevrolet impala
41 14 8 400 175 4464 11.5 71 1 pontiac catalina brougham
42 14 8 351 153 4154 13.5 71 1 ford galaxie 500
43 14 8 318 150 4096 13 71 1 plymouth fury iii
44 12 8 383 180 4955 11.5 71 1 dodge monaco (sw)
45 13 8 400 170 4746 12 71 1 ford country squire (sw)
46 13 8 400 175 5140 12 71 1 pontiac safari (sw)
47 18 6 258 110 2962 13.5 71 1 amc hornet sportabout (sw)
48 22 4 140 72 2408 19 71 1 chevrolet vega (sw)
49 19 6 250 100 3282 15 71 1 pontiac firebird
50 18 6 250 88 3139 14.5 71 1 ford mustang
51 23 4 122 86 2220 14 71 1 mercury capri 2000
52 28 4 116 90 2123 14 71 2 opel 1900
53 30 4 79 70 2074 19.5 71 2 peugeot 304
54 30 4 88 76 2065 14.5 71 2 fiat 124b
55 31 4 71 65 1773 19 71 3 toyota corolla 1200
56 35 4 72 69 1613 18 71 3 datsun 1200
57 27 4 97 60 1834 19 71 2 volkswagen model 111
58 26 4 91 70 1955 20.5 71 1 plymouth cricket
59 24 4 113 95 2278 15.5 72 3 toyota corona hardtop
60 25 4 97.5 80 2126 17 72 1 dodge colt hardtop
61 23 4 97 54 2254 23.5 72 2 volkswagen type 3
62 20 4 140 90 2408 19.5 72 1 chevrolet vega
63 21 4 122 86 2226 16.5 72 1 ford pinto runabout
64 13 8 350 165 4274 12 72 1 chevrolet impala
65 14 8 400 175 4385 12 72 1 pontiac catalina
66 15 8 318 150 4135 13.5 72 1 plymouth fury iii
67 14 8 351 153 4129 13 72 1 ford galaxie 500
68 17 8 304 150 3672 11.5 72 1 amc ambassador sst
69 11 8 429 208 4633 11 72 1 mercury marquis
70 13 8 350 155 4502 13.5 72 1 buick lesabre custom
71 12 8 350 160 4456 13.5 72 1 oldsmobile delta 88 royale
72 13 8 400 190 4422 12.5 72 1 chrysler newport royal
73 19 3 70 97 2330 13.5 72 3 mazda rx2 coupe
74 15 8 304 150 3892 12.5 72 1 amc matador (sw)
75 13 8 307 130 4098 14 72 1 chevrolet chevelle concours (sw)
76 13 8 302 140 4294 16 72 1 ford gran torino (sw)
77 14 8 318 150 4077 14 72 1 plymouth satellite custom (sw)
78 18 4 121 112 2933 14.5 72 2 volvo 145e (sw)
79 22 4 121 76 2511 18 72 2 volkswagen 411 (sw)
80 21 4 120 87 2979 19.5 72 2 peugeot 504 (sw)
81 26 4 96 69 2189 18 72 2 renault 12 (sw)
82 22 4 122 86 2395 16 72 1 ford pinto (sw)
83 28 4 97 92 2288 17 72 3 datsun 510 (sw)
84 23 4 120 97 2506 14.5 72 3 toyouta corona mark ii (sw)
85 28 4 98 80 2164 15 72 1 dodge colt (sw)
86 27 4 97 88 2100 16.5 72 3 toyota corolla 1600 (sw)
87 13 8 350 175 4100 13 73 1 buick century 350
88 14 8 304 150 3672 11.5 73 1 amc matador
89 13 8 350 145 3988 13 73 1 chevrolet malibu
90 14 8 302 137 4042 14.5 73 1 ford gran torino
91 15 8 318 150 3777 12.5 73 1 dodge coronet custom
92 12 8 429 198 4952 11.5 73 1 mercury marquis brougham
93 13 8 400 150 4464 12 73 1 chevrolet caprice classic
94 13 8 351 158 4363 13 73 1 ford ltd
95 14 8 318 150 4237 14.5 73 1 plymouth fury gran sedan
96 13 8 440 215 4735 11 73 1 chrysler new yorker brougham
97 12 8 455 225 4951 11 73 1 buick electra 225 custom
98 13 8 360 175 3821 11 73 1 amc ambassador brougham
99 18 6 225 105 3121 16.5 73 1 plymouth valiant
100 16 6 250 100 3278 18 73 1 chevrolet nova custom
101 18 6 232 100 2945 16 73 1 amc hornet
102 18 6 250 88 3021 16.5 73 1 ford maverick
103 23 6 198 95 2904 16 73 1 plymouth duster
104 26 4 97 46 1950 21 73 2 volkswagen super beetle
105 11 8 400 150 4997 14 73 1 chevrolet impala
106 12 8 400 167 4906 12.5 73 1 ford country
107 13 8 360 170 4654 13 73 1 plymouth custom suburb
108 12 8 350 180 4499 12.5 73 1 oldsmobile vista cruiser
109 18 6 232 100 2789 15 73 1 amc gremlin
110 20 4 97 88 2279 19 73 3 toyota carina
111 21 4 140 72 2401 19.5 73 1 chevrolet vega
112 22 4 108 94 2379 16.5 73 3 datsun 610
113 18 3 70 90 2124 13.5 73 3 maxda rx3
114 19 4 122 85 2310 18.5 73 1 ford pinto
115 21 6 155 107 2472 14 73 1 mercury capri v6
116 26 4 98 90 2265 15.5 73 2 fiat 124 sport coupe
117 15 8 350 145 4082 13 73 1 chevrolet monte carlo s
118 16 8 400 230 4278 9.5 73 1 pontiac grand prix
119 29 4 68 49 1867 19.5 73 2 fiat 128
120 24 4 116 75 2158 15.5 73 2 opel manta
121 20 4 114 91 2582 14 73 2 audi 100ls
122 19 4 121 112 2868 15.5 73 2 volvo 144ea
123 15 8 318 150 3399 11 73 1 dodge dart custom
124 24 4 121 110 2660 14 73 2 saab 99le
125 20 6 156 122 2807 13.5 73 3 toyota mark ii
126 11 8 350 180 3664 11 73 1 oldsmobile omega
127 20 6 198 95 3102 16.5 74 1 plymouth duster
128 21 6 200 ? 2875 17 74 1 ford maverick
129 19 6 232 100 2901 16 74 1 amc hornet
130 15 6 250 100 3336 17 74 1 chevrolet nova
131 31 4 79 67 1950 19 74 3 datsun b210
132 26 4 122 80 2451 16.5 74 1 ford pinto
133 32 4 71 65 1836 21 74 3 toyota corolla 1200
134 25 4 140 75 2542 17 74 1 chevrolet vega
135 16 6 250 100 3781 17 74 1 chevrolet chevelle malibu classic
136 16 6 258 110 3632 18 74 1 amc matador
137 18 6 225 105 3613 16.5 74 1 plymouth satellite sebring
138 16 8 302 140 4141 14 74 1 ford gran torino
139 13 8 350 150 4699 14.5 74 1 buick century luxus (sw)
140 14 8 318 150 4457 13.5 74 1 dodge coronet custom (sw)
141 14 8 302 140 4638 16 74 1 ford gran torino (sw)
142 14 8 304 150 4257 15.5 74 1 amc matador (sw)
143 29 4 98 83 2219 16.5 74 2 audi fox
144 26 4 79 67 1963 15.5 74 2 volkswagen dasher
145 26 4 97 78 2300 14.5 74 2 opel manta
146 31 4 76 52 1649 16.5 74 3 toyota corona
147 32 4 83 61 2003 19 74 3 datsun 710
148 28 4 90 75 2125 14.5 74 1 dodge colt
149 24 4 90 75 2108 15.5 74 2 fiat 128
150 26 4 116 75 2246 14 74 2 fiat 124 tc
151 24 4 120 97 2489 15 74 3 honda civic
152 26 4 108 93 2391 15.5 74 3 subaru
153 31 4 79 67 2000 16 74 2 fiat x1.9
154 19 6 225 95 3264 16 75 1 plymouth valiant custom
155 18 6 250 105 3459 16 75 1 chevrolet nova
156 15 6 250 72 3432 21 75 1 mercury monarch
157 15 6 250 72 3158 19.5 75 1 ford maverick
158 16 8 400 170 4668 11.5 75 1 pontiac catalina
159 15 8 350 145 4440 14 75 1 chevrolet bel air
160 16 8 318 150 4498 14.5 75 1 plymouth grand fury
161 14 8 351 148 4657 13.5 75 1 ford ltd
162 17 6 231 110 3907 21 75 1 buick century
163 16 6 250 105 3897 18.5 75 1 chevroelt chevelle malibu
164 15 6 258 110 3730 19 75 1 amc matador
165 18 6 225 95 3785 19 75 1 plymouth fury
166 21 6 231 110 3039 15 75 1 buick skyhawk
167 20 8 262 110 3221 13.5 75 1 chevrolet monza 2+2
168 13 8 302 129 3169 12 75 1 ford mustang ii
169 29 4 97 75 2171 16 75 3 toyota corolla
170 23 4 140 83 2639 17 75 1 ford pinto
171 20 6 232 100 2914 16 75 1 amc gremlin
172 23 4 140 78 2592 18.5 75 1 pontiac astro
173 24 4 134 96 2702 13.5 75 3 toyota corona
174 25 4 90 71 2223 16.5 75 2 volkswagen dasher
175 24 4 119 97 2545 17 75 3 datsun 710
176 18 6 171 97 2984 14.5 75 1 ford pinto
177 29 4 90 70 1937 14 75 2 volkswagen rabbit
178 19 6 232 90 3211 17 75 1 amc pacer
179 23 4 115 95 2694 15 75 2 audi 100ls
180 23 4 120 88 2957 17 75 2 peugeot 504
181 22 4 121 98 2945 14.5 75 2 volvo 244dl
182 25 4 121 115 2671 13.5 75 2 saab 99le
183 33 4 91 53 1795 17.5 75 3 honda civic cvcc
184 28 4 107 86 2464 15.5 76 2 fiat 131
185 25 4 116 81 2220 16.9 76 2 opel 1900
186 25 4 140 92 2572 14.9 76 1 capri ii
187 26 4 98 79 2255 17.7 76 1 dodge colt
188 27 4 101 83 2202 15.3 76 2 renault 12tl
189 17.5 8 305 140 4215 13 76 1 chevrolet chevelle malibu classic
190 16 8 318 150 4190 13 76 1 dodge coronet brougham
191 15.5 8 304 120 3962 13.9 76 1 amc matador
192 14.5 8 351 152 4215 12.8 76 1 ford gran torino
193 22 6 225 100 3233 15.4 76 1 plymouth valiant
194 22 6 250 105 3353 14.5 76 1 chevrolet nova
195 24 6 200 81 3012 17.6 76 1 ford maverick
196 22.5 6 232 90 3085 17.6 76 1 amc hornet
197 29 4 85 52 2035 22.2 76 1 chevrolet chevette
198 24.5 4 98 60 2164 22.1 76 1 chevrolet woody
199 29 4 90 70 1937 14.2 76 2 vw rabbit
200 33 4 91 53 1795 17.4 76 3 honda civic
201 20 6 225 100 3651 17.7 76 1 dodge aspen se
202 18 6 250 78 3574 21 76 1 ford granada ghia
203 18.5 6 250 110 3645 16.2 76 1 pontiac ventura sj
204 17.5 6 258 95 3193 17.8 76 1 amc pacer d/l
205 29.5 4 97 71 1825 12.2 76 2 volkswagen rabbit
206 32 4 85 70 1990 17 76 3 datsun b-210
207 28 4 97 75 2155 16.4 76 3 toyota corolla
208 26.5 4 140 72 2565 13.6 76 1 ford pinto
209 20 4 130 102 3150 15.7 76 2 volvo 245
210 13 8 318 150 3940 13.2 76 1 plymouth volare premier v8
211 19 4 120 88 3270 21.9 76 2 peugeot 504
212 19 6 156 108 2930 15.5 76 3 toyota mark ii
213 16.5 6 168 120 3820 16.7 76 2 mercedes-benz 280s
214 16.5 8 350 180 4380 12.1 76 1 cadillac seville
215 13 8 350 145 4055 12 76 1 chevy c10
216 13 8 302 130 3870 15 76 1 ford f108
217 13 8 318 150 3755 14 76 1 dodge d100
218 31.5 4 98 68 2045 18.5 77 3 honda accord cvcc
219 30 4 111 80 2155 14.8 77 1 buick opel isuzu deluxe
220 36 4 79 58 1825 18.6 77 2 renault 5 gtl
221 25.5 4 122 96 2300 15.5 77 1 plymouth arrow gs
222 33.5 4 85 70 1945 16.8 77 3 datsun f-10 hatchback
223 17.5 8 305 145 3880 12.5 77 1 chevrolet caprice classic
224 17 8 260 110 4060 19 77 1 oldsmobile cutlass supreme
225 15.5 8 318 145 4140 13.7 77 1 dodge monaco brougham
226 15 8 302 130 4295 14.9 77 1 mercury cougar brougham
227 17.5 6 250 110 3520 16.4 77 1 chevrolet concours
228 20.5 6 231 105 3425 16.9 77 1 buick skylark
229 19 6 225 100 3630 17.7 77 1 plymouth volare custom
230 18.5 6 250 98 3525 19 77 1 ford granada
231 16 8 400 180 4220 11.1 77 1 pontiac grand prix lj
232 15.5 8 350 170 4165 11.4 77 1 chevrolet monte carlo landau
233 15.5 8 400 190 4325 12.2 77 1 chrysler cordoba
234 16 8 351 149 4335 14.5 77 1 ford thunderbird
235 29 4 97 78 1940 14.5 77 2 volkswagen rabbit custom
236 24.5 4 151 88 2740 16 77 1 pontiac sunbird coupe
237 26 4 97 75 2265 18.2 77 3 toyota corolla liftback
238 25.5 4 140 89 2755 15.8 77 1 ford mustang ii 2+2
239 30.5 4 98 63 2051 17 77 1 chevrolet chevette
240 33.5 4 98 83 2075 15.9 77 1 dodge colt m/m
241 30 4 97 67 1985 16.4 77 3 subaru dl
242 30.5 4 97 78 2190 14.1 77 2 volkswagen dasher
243 22 6 146 97 2815 14.5 77 3 datsun 810
244 21.5 4 121 110 2600 12.8 77 2 bmw 320i
245 21.5 3 80 110 2720 13.5 77 3 mazda rx-4
246 43.1 4 90 48 1985 21.5 78 2 volkswagen rabbit custom diesel
247 36.1 4 98 66 1800 14.4 78 1 ford fiesta
248 32.8 4 78 52 1985 19.4 78 3 mazda glc deluxe
249 39.4 4 85 70 2070 18.6 78 3 datsun b210 gx
250 36.1 4 91 60 1800 16.4 78 3 honda civic cvcc
251 19.9 8 260 110 3365 15.5 78 1 oldsmobile cutlass salon brougham
252 19.4 8 318 140 3735 13.2 78 1 dodge diplomat
253 20.2 8 302 139 3570 12.8 78 1 mercury monarch ghia
254 19.2 6 231 105 3535 19.2 78 1 pontiac phoenix lj
255 20.5 6 200 95 3155 18.2 78 1 chevrolet malibu
256 20.2 6 200 85 2965 15.8 78 1 ford fairmont (auto)
257 25.1 4 140 88 2720 15.4 78 1 ford fairmont (man)
258 20.5 6 225 100 3430 17.2 78 1 plymouth volare
259 19.4 6 232 90 3210 17.2 78 1 amc concord
260 20.6 6 231 105 3380 15.8 78 1 buick century special
261 20.8 6 200 85 3070 16.7 78 1 mercury zephyr
262 18.6 6 225 110 3620 18.7 78 1 dodge aspen
263 18.1 6 258 120 3410 15.1 78 1 amc concord d/l
264 19.2 8 305 145 3425 13.2 78 1 chevrolet monte carlo landau
265 17.7 6 231 165 3445 13.4 78 1 buick regal sport coupe (turbo)
266 18.1 8 302 139 3205 11.2 78 1 ford futura
267 17.5 8 318 140 4080 13.7 78 1 dodge magnum xe
268 30 4 98 68 2155 16.5 78 1 chevrolet chevette
269 27.5 4 134 95 2560 14.2 78 3 toyota corona
270 27.2 4 119 97 2300 14.7 78 3 datsun 510
271 30.9 4 105 75 2230 14.5 78 1 dodge omni
272 21.1 4 134 95 2515 14.8 78 3 toyota celica gt liftback
273 23.2 4 156 105 2745 16.7 78 1 plymouth sapporo
274 23.8 4 151 85 2855 17.6 78 1 oldsmobile starfire sx
275 23.9 4 119 97 2405 14.9 78 3 datsun 200-sx
276 20.3 5 131 103 2830 15.9 78 2 audi 5000
277 17 6 163 125 3140 13.6 78 2 volvo 264gl
278 21.6 4 121 115 2795 15.7 78 2 saab 99gle
279 16.2 6 163 133 3410 15.8 78 2 peugeot 604sl
280 31.5 4 89 71 1990 14.9 78 2 volkswagen scirocco
281 29.5 4 98 68 2135 16.6 78 3 honda accord lx
282 21.5 6 231 115 3245 15.4 79 1 pontiac lemans v6
283 19.8 6 200 85 2990 18.2 79 1 mercury zephyr 6
284 22.3 4 140 88 2890 17.3 79 1 ford fairmont 4
285 20.2 6 232 90 3265 18.2 79 1 amc concord dl 6
286 20.6 6 225 110 3360 16.6 79 1 dodge aspen 6
287 17 8 305 130 3840 15.4 79 1 chevrolet caprice classic
288 17.6 8 302 129 3725 13.4 79 1 ford ltd landau
289 16.5 8 351 138 3955 13.2 79 1 mercury grand marquis
290 18.2 8 318 135 3830 15.2 79 1 dodge st. regis
291 16.9 8 350 155 4360 14.9 79 1 buick estate wagon (sw)
292 15.5 8 351 142 4054 14.3 79 1 ford country squire (sw)
293 19.2 8 267 125 3605 15 79 1 chevrolet malibu classic (sw)
294 18.5 8 360 150 3940 13 79 1 chrysler lebaron town @ country (sw)
295 31.9 4 89 71 1925 14 79 2 vw rabbit custom
296 34.1 4 86 65 1975 15.2 79 3 maxda glc deluxe
297 35.7 4 98 80 1915 14.4 79 1 dodge colt hatchback custom
298 27.4 4 121 80 2670 15 79 1 amc spirit dl
299 25.4 5 183 77 3530 20.1 79 2 mercedes benz 300d
300 23 8 350 125 3900 17.4 79 1 cadillac eldorado
301 27.2 4 141 71 3190 24.8 79 2 peugeot 504
302 23.9 8 260 90 3420 22.2 79 1 oldsmobile cutlass salon brougham
303 34.2 4 105 70 2200 13.2 79 1 plymouth horizon
304 34.5 4 105 70 2150 14.9 79 1 plymouth horizon tc3
305 31.8 4 85 65 2020 19.2 79 3 datsun 210
306 37.3 4 91 69 2130 14.7 79 2 fiat strada custom
307 28.4 4 151 90 2670 16 79 1 buick skylark limited
308 28.8 6 173 115 2595 11.3 79 1 chevrolet citation
309 26.8 6 173 115 2700 12.9 79 1 oldsmobile omega brougham
310 33.5 4 151 90 2556 13.2 79 1 pontiac phoenix
311 41.5 4 98 76 2144 14.7 80 2 vw rabbit
312 38.1 4 89 60 1968 18.8 80 3 toyota corolla tercel
313 32.1 4 98 70 2120 15.5 80 1 chevrolet chevette
314 37.2 4 86 65 2019 16.4 80 3 datsun 310
315 28 4 151 90 2678 16.5 80 1 chevrolet citation
316 26.4 4 140 88 2870 18.1 80 1 ford fairmont
317 24.3 4 151 90 3003 20.1 80 1 amc concord
318 19.1 6 225 90 3381 18.7 80 1 dodge aspen
319 34.3 4 97 78 2188 15.8 80 2 audi 4000
320 29.8 4 134 90 2711 15.5 80 3 toyota corona liftback
321 31.3 4 120 75 2542 17.5 80 3 mazda 626
322 37 4 119 92 2434 15 80 3 datsun 510 hatchback
323 32.2 4 108 75 2265 15.2 80 3 toyota corolla
324 46.6 4 86 65 2110 17.9 80 3 mazda glc
325 27.9 4 156 105 2800 14.4 80 1 dodge colt
326 40.8 4 85 65 2110 19.2 80 3 datsun 210
327 44.3 4 90 48 2085 21.7 80 2 vw rabbit c (diesel)
328 43.4 4 90 48 2335 23.7 80 2 vw dasher (diesel)
329 36.4 5 121 67 2950 19.9 80 2 audi 5000s (diesel)
330 30 4 146 67 3250 21.8 80 2 mercedes-benz 240d
331 44.6 4 91 67 1850 13.8 80 3 honda civic 1500 gl
332 40.9 4 85 ? 1835 17.3 80 2 renault lecar deluxe
333 33.8 4 97 67 2145 18 80 3 subaru dl
334 29.8 4 89 62 1845 15.3 80 2 vokswagen rabbit
335 32.7 6 168 132 2910 11.4 80 3 datsun 280-zx
336 23.7 3 70 100 2420 12.5 80 3 mazda rx-7 gs
337 35 4 122 88 2500 15.1 80 2 triumph tr7 coupe
338 23.6 4 140 ? 2905 14.3 80 1 ford mustang cobra
339 32.4 4 107 72 2290 17 80 3 honda accord
340 27.2 4 135 84 2490 15.7 81 1 plymouth reliant
341 26.6 4 151 84 2635 16.4 81 1 buick skylark
342 25.8 4 156 92 2620 14.4 81 1 dodge aries wagon (sw)
343 23.5 6 173 110 2725 12.6 81 1 chevrolet citation
344 30 4 135 84 2385 12.9 81 1 plymouth reliant
345 39.1 4 79 58 1755 16.9 81 3 toyota starlet
346 39 4 86 64 1875 16.4 81 1 plymouth champ
347 35.1 4 81 60 1760 16.1 81 3 honda civic 1300
348 32.3 4 97 67 2065 17.8 81 3 subaru
349 37 4 85 65 1975 19.4 81 3 datsun 210 mpg
350 37.7 4 89 62 2050 17.3 81 3 toyota tercel
351 34.1 4 91 68 1985 16 81 3 mazda glc 4
352 34.7 4 105 63 2215 14.9 81 1 plymouth horizon 4
353 34.4 4 98 65 2045 16.2 81 1 ford escort 4w
354 29.9 4 98 65 2380 20.7 81 1 ford escort 2h
355 33 4 105 74 2190 14.2 81 2 volkswagen jetta
356 34.5 4 100 ? 2320 15.8 81 2 renault 18i
357 33.7 4 107 75 2210 14.4 81 3 honda prelude
358 32.4 4 108 75 2350 16.8 81 3 toyota corolla
359 32.9 4 119 100 2615 14.8 81 3 datsun 200sx
360 31.6 4 120 74 2635 18.3 81 3 mazda 626
361 28.1 4 141 80 3230 20.4 81 2 peugeot 505s turbo diesel
362 30.7 6 145 76 3160 19.6 81 2 volvo diesel
363 25.4 6 168 116 2900 12.6 81 3 toyota cressida
364 24.2 6 146 120 2930 13.8 81 3 datsun 810 maxima
365 22.4 6 231 110 3415 15.8 81 1 buick century
366 26.6 8 350 105 3725 19 81 1 oldsmobile cutlass ls
367 20.2 6 200 88 3060 17.1 81 1 ford granada gl
368 17.6 6 225 85 3465 16.6 81 1 chrysler lebaron salon
369 28 4 112 88 2605 19.6 82 1 chevrolet cavalier
370 27 4 112 88 2640 18.6 82 1 chevrolet cavalier wagon
371 34 4 112 88 2395 18 82 1 chevrolet cavalier 2-door
372 31 4 112 85 2575 16.2 82 1 pontiac j2000 se hatchback
373 29 4 135 84 2525 16 82 1 dodge aries se
374 27 4 151 90 2735 18 82 1 pontiac phoenix
375 24 4 140 92 2865 16.4 82 1 ford fairmont futura
376 23 4 151 ? 3035 20.5 82 1 amc concord dl
377 36 4 105 74 1980 15.3 82 2 volkswagen rabbit l
378 37 4 91 68 2025 18.2 82 3 mazda glc custom l
379 31 4 91 68 1970 17.6 82 3 mazda glc custom
380 38 4 105 63 2125 14.7 82 1 plymouth horizon miser
381 36 4 98 70 2125 17.3 82 1 mercury lynx l
382 36 4 120 88 2160 14.5 82 3 nissan stanza xe
383 36 4 107 75 2205 14.5 82 3 honda accord
384 34 4 108 70 2245 16.9 82 3 toyota corolla
385 38 4 91 67 1965 15 82 3 honda civic
386 32 4 91 67 1965 15.7 82 3 honda civic (auto)
387 38 4 91 67 1995 16.2 82 3 datsun 310 gx
388 25 6 181 110 2945 16.4 82 1 buick century limited
389 38 6 262 85 3015 17 82 1 oldsmobile cutlass ciera (diesel)
390 26 4 156 92 2585 14.5 82 1 chrysler lebaron medallion
391 22 6 232 112 2835 14.7 82 1 ford granada l
392 32 4 144 96 2665 13.9 82 3 toyota celica gt
393 36 4 135 84 2370 13 82 1 dodge charger 2.2
394 27 4 151 90 2950 17.3 82 1 chevrolet camaro
395 27 4 140 86 2790 15.6 82 1 ford mustang gl
396 44 4 97 52 2130 24.6 82 2 vw pickup
397 32 4 135 84 2295 11.6 82 1 dodge rampage
398 28 4 120 79 2625 18.6 82 1 ford ranger
399 31 4 119 82 2720 19.4 82 1 chevy s-10

136
subjects/ai/keras/README.md

@ -0,0 +1,136 @@
# Keras
The goal of this day is to learn to use Keras to build Neural Networks. As explained on Keras website, Keras is a deep learning API written in Python, running on top of the machine learning platform TensorFlow. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result as fast as possible is key to doing good research.
And, TensorFlow was created by the Google Brain team, TensorFlow is an open source library for numerical computation and large-scale machine learning. TensorFlow bundles together a slew of machine learning and deep learning (aka neural networking) models and algorithms and makes them useful by way of a common metaphor. It uses Python to provide a convenient front-end API for building applications with the framework, while executing those applications in high-performance C++.
There are two ways to build Keras models: sequential and functional.The sequential API allows you to create models layer-by-layer for most problems. It is limited in that it does not allow you to create models that share layers or have multiple inputs or outputs. The exercises focuses on the usage of the sequential API.
Note:
The audit will provide the code and output because it is not straightforward to reproduce results using Keras. There are many source of randomness. Even if all the seeds are fixed to a constant they may be other source of randomness. https://machinelearningmastery.com/reproducible-results-neural-networks-keras/
### Exercises of the day
- Exercise 0: Environment and libraries
- Exercise 1: Sequential
- Exercise 2: Dense
- Exercise 3: Architecture
- Exercise 4: Optimize
### Virtual Environment
- Python 3.x
- NumPy
- Pandas
- Jupyter or JupyterLab
- Keras
_Version of Keras I used to do the exercises: 2.4.3_.
I suggest to use the most recent one.
### **Resources**
- https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
---
---
# Exercise 0: Environment and libraries
The goal of this exercise is to set up the Python work environment with the required libraries.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
I recommend to use:
- the **last stable versions** of Python.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
1. Create a virtual environment named with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy`, `jupyter`, and `keras`.
---
---
# Exercise 1: Sequential
The goal of this exercise is to learn to call the object `Sequential`.
1. Put the object Sequential in a variable named `model` and print the variable `model`.
---
---
# Exercise 2: Dense
The goal of this exercise is to learn to create layers of neurons. Keras proposes options to create custom layers. The neural networks build in these exercises do not require custom layers. `Dense` layers do the job. A dense layer is simply a layer where each unit or neuron is connected to each neuron in the next layer. As seen yesterday, there are three main types of layers: input, hidden and output. The **input layer** that specifies the number of inputs (features) is not represented as a layer in Keras. However, `Dense` has a parameter `input_dim` that gives the number of inputs in the previous layer. The output layer as any hidden layer can be created using `Dense`, the only difference is that the output layer contains one single neuron.
1. Create a `Dense` layer with these parameters and return the output of `get_config`:
- First hidden layer connected to 5 input variables.
- 8 neurons
- sigmoid as activation function
2. Create a `Dense` layer with these parameters and return the output of `get_config`:
- Hidden layer (not the first one)
- 4 neurons
- sigmoid as activation function
3. Create a `Dense` layer with these parameters and return the output of `get_config`:
- Output layer
- 1 neuron
- sigmoid as activation function
---
---
# Exercise 3: Architecture
The goal of this exercise is to combine the layers and to create a neural network.
1. Create a neural network for regression with the following architecture and return `print(model.summary())`:
- 5 inputs variables
- hidden layer 1: 8 neurons and sigmoid as activation function
- hidden layer 2: 4 neurons and sigmoid as activation function
- output layer: 1 neuron. Find the adapted activation function
---
---
# Exercise 4: Optimize
The goal of this exercise is to learn to train the neural network. Once the architecture of the neural network is set there are two steps to train the neural network:
- `compile`: The compilation step aims to set the loss function, to choose the algoithm to minimize the chosen loss function and to choose the metric the model outputs.
- The **optimizer**. We’ll stick with a pretty good default: the Adam gradient-based optimizer. Keras has many other optimizers you can look into as well.
- The **loss function**. Depending on the problem to solve: classification or regression Keras proposes different loss functions. In classification Keras distinguishes between `binary_crossentropy` (2 classes) and `categorical_crossentropy` (>2 classes), so we’ll use the latter.
- The **metric(s)**. A list of metrics. Depending on the problem to solve: classification or regression Keras proposes different loss functions. For example for classification the metric can be the accuracy.
- `fit`: Training a model in Keras literally consists only of calling fit() and specifying some parameters. There are a lot of possible parameters, but we’ll only manually supply a few:
- The **training data**, commonly known as X and Y, respectively.
- The **number of epochs** (iterations over the entire dataset) to train for.
- The **batch size** (number of samples per gradient update) to use when training.
This article gives more details about **epoch** and **batch size**: https://machinelearningmastery.com/difference-between-a-batch-and-an-epoch/
1. Create the following neural network (classification):
- Set the right number of inputs variables
- hidden layer 1: 10 neurons and sigmoid as activation function.
- hidden layer 2: 5 neurons and sigmoid as activation function.
- output layer: 1 neuron and sigmoid as activation function.
- Choose the accuracy metric, the adam optimizer, the adapted loss and epoch smaller than 50.
Import the breast cancer data set from `sklearn.datasets` using `load_breast_cancer` and train the neural network on the data set.
2. Scale the data using `StandardScaler` from `sklearn.preprocessing`. Train the neural network again.

172
subjects/ai/keras/audit/README.md

@ -0,0 +1,172 @@
#### Exercise 0: Environment and libraries
##### The exercice is validated is all questions of the exercice are validated
##### Activate the virtual environment. If you used `conda` run `conda activate your_env`
##### Run `python --version`
###### Does it print `Python 3.x`? x >= 8
##### Does `import jupyter`, `import numpy`, `import pandas`, and `import keras` run without any error ?
---
---
#### Exercise 1: Sequential
##### The question 1 is validated if the output ends with `keras.engine.sequential.Sequential object at xxx`
---
---
#### Exercise 2: Dense
##### The exercice is validated is all questions of the exercice are validated
##### The question 1 is validated if the fields `batch_input_shape`, `units` and `activation` match this output:
```
{'name': 'dense_7',
'trainable': True,
'batch_input_shape': (None, 5),
'dtype': 'float32',
'units': 8,
'activation': 'sigmoid',
'use_bias': True,
'kernel_initializer': {'class_name': 'GlorotUniform',
'config': {'seed': None}},
'bias_initializer': {'class_name': 'Zeros', 'config': {}},
'kernel_regularizer': None,
'bias_regularizer': None,
'activity_regularizer': None,
'kernel_constraint': None,
'bias_constraint': None}
```
##### The question 2 is validated if the fields `units` and `activation` match this output:
```
{'name': 'dense_8',
'trainable': True,
'dtype': 'float32',
'units': 4,
'activation': 'sigmoid',
'use_bias': True,
'kernel_initializer': {'class_name': 'GlorotUniform',
'config': {'seed': None}},
'bias_initializer': {'class_name': 'Zeros', 'config': {}},
'kernel_regularizer': None,
'bias_regularizer': None,
'activity_regularizer': None,
'kernel_constraint': None,
'bias_constraint': None}
```
##### The question 3 is validated if the fields `units` and `activation` match this output:
```
{'name': 'dense_9',
'trainable': True,
'dtype': 'float32',
'units': 1,
'activation': 'sigmoid',
'use_bias': True,
'kernel_initializer': {'class_name': 'GlorotUniform',
'config': {'seed': None}},
'bias_initializer': {'class_name': 'Zeros', 'config': {}},
'kernel_regularizer': None,
'bias_regularizer': None,
'activity_regularizer': None,
'kernel_constraint': None,
'bias_constraint': None}
```
---
---
#### Exercise 3: Architecture
##### The question 1 is validated if the code that creates the neural network is:
```
model = keras.Sequential()
model.add(Dense(8, input_shape=(5,), activation= 'sigmoid'))
model.add(Dense(4, activation= 'sigmoid'))
model.add(Dense(1, activation= 'linear'))
```
The first two layers could use another activation function that sigmoid (eg: relu)
---
---
#### Exercise 4: Optimize
##### The question 1 is validated if the output of `model.get_config()['layers']` matches the fields `batch_input_shape`, `units` and `activation`.
```
[{'class_name': 'InputLayer',
'config': {'batch_input_shape': (None, 30),
'dtype': 'float32',
'sparse': False,
'ragged': False,
'name': 'dense_134_input'}},
{'class_name': 'Dense',
'config': {'name': 'dense_134',
'trainable': True,
'batch_input_shape': (None, 30),
'dtype': 'float32',
'units': 10,
'activation': 'sigmoid',
'use_bias': True,
'kernel_initializer': {'class_name': 'GlorotUniform',
'config': {'seed': None}},
'bias_initializer': {'class_name': 'Zeros', 'config': {}},
'kernel_regularizer': None,
'bias_regularizer': None,
'activity_regularizer': None,
'kernel_constraint': None,
'bias_constraint': None}},
{'class_name': 'Dense',
'config': {'name': 'dense_135',
'trainable': True,
'dtype': 'float32',
'units': 5,
'activation': 'sigmoid',
'use_bias': True,
'kernel_initializer': {'class_name': 'GlorotUniform',
'config': {'seed': None}},
'bias_initializer': {'class_name': 'Zeros', 'config': {}},
'kernel_regularizer': None,
'bias_regularizer': None,
'activity_regularizer': None,
'kernel_constraint': None,
'bias_constraint': None}},
{'class_name': 'Dense',
'config': {'name': 'dense_136',
'trainable': True,
'dtype': 'float32',
'units': 1,
'activation': 'sigmoid',
'use_bias': True,
'kernel_initializer': {'class_name': 'GlorotUniform',
'config': {'seed': None}},
'bias_initializer': {'class_name': 'Zeros', 'config': {}},
'kernel_regularizer': None,
'bias_regularizer': None,
'activity_regularizer': None,
'kernel_constraint': None,
'bias_constraint': None}}]
```
You should notice that the neural network is struggling to learn. By luck the initialization of the weights might have led to an accuracy close of 90%. But when I trained the neural network, with `batch_size=300` on the data here is the ouptput of the last epoch (50):
`Epoch 50/50 2/2 [==============================] - 0s 1ms/step - loss: 0.6559 - accuracy: 0.6274`
##### The question 2 is validated if the the accuracy at epoch 50 is higher than 95%.

325
subjects/ai/linear-regression-with-scikit-learn/README.md

@ -0,0 +1,325 @@
![Alt Text](w2_day01_linear_regression_video.gif)
# Linear regression with Scikit Learn
The goal of this day is to understand practical Linear regression and supervised learning.
The word "regression" was introduced by Sir Francis Galton (a cousin of C. Darwin) when he
studied the size of individuals within a progeny. He was trying to understand why
large individuals in a population appeared to have smaller children, more
close to the average population size; hence the introduction of the term "regression".
Today we will learn a basic algorithm used in **supervised learning** : **The Linear Regression**. We will be using **Scikit-learn** which is a machine learning library. It is designed to interoperate with the Python libraries NumPy and Pandas.
We will also learn progressively the Machine Learning methodology for supervised learning - today we will focus on evaluating a machine learning model by splitting the data set in a train set and a test set.
### Exercises of the day
- Exercise 0: Environment and libraries
- Exercise 1: Scikit-learn estimator
- Exercise 2: Linear regression in 1D
- Exercise 3: Train test split
- Exercise 4: Forecast diabetes progression
- Exercise 5: Gradient Descent (**Optional**)
### Virtual Environment
- Python 3.x
- NumPy
- Pandas
- Matplotlib
- Scikit Learn
- Jupyter or JupyterLab
_Version of Scikit Learn I used to do the exercises: 0.22_. I suggest to use the most recent one. Scikit Learn 1.0 is finally available after ... 14 years.
### **Resources**
### To start with Scikit-learn
- https://scikit-learn.org/stable/tutorial/basic/tutorial.html
- https://jakevdp.github.io/PythonDataScienceHandbook/05.02-introducing-scikit-learn.html
- https://scikit-learn.org/stable/modules/linear_model.html
### Machine learning methodology and algorithms
- This course provides a broad introduction to machine learning, datamining, and statistical pattern recognition. Andrew Ng is a star in the Machine Learning community. I recommend to spend some time during the projects to focus on some algorithms. However, Python is not the language used for the course. https://www.coursera.org/learn/machine-learning
- https://docs.microsoft.com/en-us/azure/machine-learning/algorithm-cheat-sheet
- https://scikit-learn.org/stable/tutorial/index.html
### Linear Regression
- https://towardsdatascience.com/laymans-introduction-to-linear-regression-8b334a3dab09
- https://towardsdatascience.com/linear-regression-the-actually-complete-introduction-67152323fcf2
### Train test split
- https://machinelearningmastery.com/train-test-split-for-evaluating-machine-learning-algorithms/
- https://developers.google.com/machine-learning/crash-course/training-and-test-sets/video-lecture?hl=en
---
---
# Exercise 0: Environment and libraries
The goal of this exercise is to set up the Python work environment with the required libraries.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
I recommend to use:
- the **last stable versions** of Python.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy`, `jupyter`, `matplotlib` and `scikit-learn`.
---
---
# Exercise 1: Scikit-learn estimator
The goal of this exercise is to learn to fit a Scikit-learn estimator and use it to predict.
```console
X, y = [[1],[2.1],[3]], [[1],[2],[3]]
```
1. Fit a LinearRegression from Scikit-learn with X the features and y the target and predict for `x_pred = [[4]]`
2. Print the coefficients (`coefs_`) and the intercept (`intercept_`), the score (`score`) of the regression of X and y.
---
---
# Exercise 2: Linear regression in 1D
The goal of this exercise is to understand how the linear regression works in one dimension. To do so, we will generate a data in one dimension. Using `make regression` from Scikit-learn, generate a data set with 100 observations:
```python
X, y, coef = make_regression(n_samples=100,
n_features=1,
n_informative=1,
noise=10,
coef=True,
random_state=0,
bias=100.0)
```
1. Plot the data using matplotlib. The plot should look like this:
![alt text][q1]
[q1]: ./w2_day1_ex2_q1.png "Scatter plot"
2. Fit a LinearRegression from Scikit-learn on the generated data and give the equation of the fitted line. The expected output is: `y = coef * x + intercept`
3. Add the fitted line to the plot. the plot should look like this:
![alt text][q3]
[q3]: ./w2_day1_ex2_q3.png "Scatter plot + fitted line"
4. Predict on X.
5. Create a function that computes the Mean Squared Error (MSE) and compute the MSE on the data set. _The MSE is frequently used as well as other regression metrics that will be studied later this week._
```
def compute_mse(y_true, y_pred):
#TODO
return mse
```
Change the `noise` parameter of `make_regression` to 50
6. Repeat question 2, 4 and compute the MSE on the new data.
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html
---
---
# Exercise 3: Train test split
The goal of this exercise is to learn to split a data set. It is important to understand why we split the data in two sets. To put it in a nutshell: the Machine Learning model learns on the training data and evaluates on the data the model hasn't seen before: the testing data.
This video gives a basic and nice explanation: https://www.youtube.com/watch?v=_vdMKioCXqQ
This article explains the conditions to split the data and how to split it: https://machinelearningmastery.com/train-test-split-for-evaluating-machine-learning-algorithms/
```python
X = np.arange(1,21).reshape(10,-1)
y = np.arange(1,11)
```
1. Split the data using `train_test_split` with `shuffle=False`. The test set represents 20% of the total size of the data set. Print X_train, y_train, X_test, y_test.
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
---
---
# Exercise 4: Forecast diabetes progression
The goal of this exercise is to use Linear Regression to forecast the progression of diabetes. It will not always be precised, you should **ALWAYS** start doing an exploratory data analysis in order to have a good understanding of the data you model. As a reminder here an introduction to EDA:
- https://towardsdatascience.com/exploratory-data-analysis-eda-a-practical-guide-and-template-for-structured-data-abfbf3ee3bd9
The data set used is described in https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.
```python
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target
```
1. Using `train_test_split`, split the data set in a train set, and test set (20%). Use `random_state=43` for results reproducibility.
2. Fit the Linear Regression on all the variables. Give the coefficients and the intercept of the Linear Regression. What is the the equation ?
3. Predict on the test set. Predicting on the test set is like having new patients for who, as a physician, need to forecast the disease progression in one year given the 10 baseline variables.
4. Compute the MSE on the train set and test set. Later this week we will learn about the R2 which will help us to evaluate the performance of this fitted Linear Regression. The MSE returns an arbitrary value depending on the range of error.
**WARNING**: This will be explained later this week. But here, we are doing something "dangerous". As you may have read in the data documentation the data is scaled using the whole dataset whereas we should first scale the data on the training set and then use this scaling on the test set. This is a toy example, so let's ignore this detail for now.
https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset
---
---
# Exercise 5: Gradient Descent (Optional)
The goal of this exercise is to understand how the Linear Regression algorithm finds the optimal coefficients.
The goal is to fit a Linear Regression on a one dimensional features data **without using Scikit-learn**. Let's use the data set we generated for the exercise 2:
```python
X, y, coef = make_regression(n_samples=100,
n_features=1,
n_informative=1,
noise=10,
coef=True,
random_state=0,
bias=100.0)
```
_Warning: The shape of X is not the same as the shape of y. You may need (for some questions) to reshape X using: `X.reshape(1,-1)[0]`._
1. Plot the data using matplotlib:
![alt text][ex5q1]
[ex5q1]: ./w2_day1_ex5_q1.png "Scatter plot "
As a reminder, fitting a Linear Regression on this data means finding (a,b) that fits well the data points.
- `y_pred = a*x +b`
Mathematically, it means finding (a,b) that minimizes the MSE, which is the loss used in Linear Regression. If we consider 3 data points:
- `Loss(a,b) = MSE(a,b) = 1/3 *((y_pred1 - y_true1)**2 + (y_pred2 - y_true2)**2) + (y_pred3 - y_true3)**2)`
and we know:
y_pred1 = a*x1 + b\
y_pred2 = a*x2 + b\
y_pred3 = a\*x3 + b
### Greedy approach
2. Create a function `compute_mse`. Compute mse for `a = 1` and `b = 2`.
**Warning**: `X.shape` is `(100, 1)` and `y.shape` is `(100, )`. Make sure that `y_preds` and `y` have the same shape before to compute `y_preds-y`.
```python
def compute_mse(coefs, X, y):
'''
coefs is a list that contains a and b: [a,b]
X is the features set
y is the target
Returns a float which is the MSE
'''
#TODO
y_preds =
mse =
return mse
```
3. Create a grid of **640000** points that combines a and b with. Check that the grid contains 640000 points.
- a between -200 and 200, step= 0.5
- b between -200 and 200, step= 0.5
This is how to compute the grid with the combination of a and b:
```python
aa, bb = np.mgrid[-200:200:0.5, -200:200:0.5]
grid = np.c_[aa.ravel(), bb.ravel()]
```
4. Compute the MSE for all points in the grid. If possible, parallelize the computations. It may be needed to use `functools.partial` to parallelize a function with many parameters on a list. Put the result in a variable named `losses`.
5. Use this chunk of code to plot the MSE in 2D:
```python
aa, bb = np.mgrid[-200:200:.5, -200:200:.5]
grid = np.c_[aa.ravel(), bb.ravel()]
losses_reshaped = np.array(losses).reshape(aa.shape)
f, ax = plt.subplots(figsize=(8, 6))
contour = ax.contourf(aa,
bb,
losses_reshaped,
100,
cmap="RdBu",
vmin=0,
vmax=160000)
ax_c = f.colorbar(contour)
ax_c.set_label("MSE")
ax.set(aspect="equal",
xlim=(-200, 200),
ylim=(-200, 200),
xlabel="$a$",
ylabel="$b$")
```
The expected output is:
![alt text][ex5q5]
[ex5q5]: ./w2_day1_ex5_q5.png "MSE "
6. From the `losses` list, find the optimal value of a and b and plot the line in the scatter point of question 1.
In this example we computed 160 000 times the MSE. It is frequent to deal with 50 features, which requires 51 parameters to fit the Linear Regression. If we try this approach with 50 features we would need to compute **5.07e+132** MSE. Even if we reduce the scope and try only 5 values per coefficients we would have to compute the MSE **4.4409e+35** times. This approach is not scalable and that is why is not used to find optimal coefficients for Linear Regression.
### Gradient Descent
In a nutshel, Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. In machine learning, we use gradient descent to update the parameters (a and b) of our model. Parameters refer to the coefficients used in Linear Regression. Before to start implementing the questions, take the time to read the article. https://jairiidriss.medium.com/gradient-descent-algorithm-from-scratch-using-python-2b36c1548917. It explains the gradient descent and how to implement it. The "tricky" part is the computation of the derivative of the mse. You can admit the formulas of the derivatives to implement the gradient descent (`d_theta_0` and `d_theta_1` in the article).
7. Implement the gradient descent to find optimal a and b with `learning rate = 0.1` and `nbr_iterations=100`.
8. Save the a and b through the iterations in a two dimensional numpy array. Add them to the plot of the previous part and observe a and b that converge towards the minimum. The plot should look like this:
![alt text][ex5q8]
[ex5q8]: ./w2_day1_ex5_q8.png "MSE + Gradient descent"
9. Use Linear Regression from Scikit-learn. Compare the results.

223
subjects/ai/linear-regression-with-scikit-learn/audit/README.md

@ -0,0 +1,223 @@
#### Linear regression with Scikit Learn
#### Exercise 0: Environment and libraries
##### The exercice is validated is all questions of the exercice are validated
##### Activate the virtual environment. If you used `conda` run `conda activate your_env`
##### Run `python --version`
###### Does it print `Python 3.x`? x >= 8
###### Does `import jupyter`, `import numpy`, `import pandas`, `import matplotlib` and `import sklearn` run without any error ?
---
---
#### Exercise 1: Scikit-learn estimator
##### The question 1 is validated if the output is:
```python
array([[3.96013289]])
```
##### The question 2 is validated if the output is:
```output
Coefficients: [[0.99667774]]
Intercept: [-0.02657807]
Score: 0.9966777408637874
```
---
---
#### Exercise 2: Linear regression in 1D
##### The exercise is validated is all questions of the exercise are validated
##### The question 1 is validated if the plot looks like:
![alt text][q1]
[q1]: ../w2_day1_ex2_q1.png "Scatter plot"
###### The question 2 is validated if the equation of the fitted line is: `y = 42.619430291366946 * x + 99.18581817296929`
###### The question 3 is validated if the plot looks like:
![alt text][q3]
[q3]: ../w2_day1_ex2_q3.png "Scatter plot + fitted line"
###### The question 4 is validated if the outputted prediction for the first 10 values are:
```python
array([ 83.86186727, 140.80961751, 116.3333897 , 64.52998689,
61.34889539, 118.10301628, 57.5347917 , 117.44107847,
108.06237908, 85.90762675])
```
###### The question 5 is validated if the MSE returned is `114.17148616819485`
###### The question 6 is validated if the MSE returned is `2854.2871542048706`
---
---
#### Exercise 3: Train test split
##### The question 1 is validated if X_train, y_train, X_test, y_test match this output:
```console
X_train:
[[ 1 2]
[ 3 4]
[ 5 6]
[ 7 8]
[ 9 10]
[11 12]
[13 14]
[15 16]]
y_train:
[1 2 3 4 5 6 7 8]
X_test:
[[17 18]
[19 20]]
y_test:
[ 9 10]
```
---
---
#### Exercise 4: Forecast diabetes progression
##### The exercice is validated is all questions of the exercice are validated
##### The question 1 is validated if the output of `y_train.values[:10]` and `y_test.values[:10]`are:
```console
y_train.values[:10]:
[[202.]
[ 55.]
[202.]
[ 42.]
[214.]
[173.]
[118.]
[ 90.]
[129.]
[151.]]
y_test.values[:10]:
[[ 71.]
[ 72.]
[235.]
[277.]
[109.]
[ 61.]
[109.]
[ 78.]
[ 66.]
[192.]]
```
##### The question 2 is validated if the coefficients and the intercept are:
```console
[('age', -60.40163046086952),
('sex', -226.08740652083418),
('bmi', 529.383623302316),
('bp', 259.96307686274605),
('s1', -859.121931974365),
('s2', 504.70960058378813),
('s3', 157.42034928335502),
('s4', 226.29533600601638),
('s5', 840.7938070846119),
('s6', 34.712225788519554),
('intercept', 152.05314895029233)]
```
##### The question 3 is validated if the output of `predictions_on_test[:10]` is:
```console
array([[111.74351759],
[ 98.41335251],
[168.36373195],
[255.05882934],
[168.43764643],
[117.60982186],
[198.86966323],
[126.28961941],
[117.73121787],
[224.83346984]])
```
##### The question 4 is validated if the mse on the **train set** is `2888.326888` and the mse on the **test set** is `2858.255153`.
---
---
#### Exercise 5: Gradient Descent (Optional)
##### The exercise is validated if all questions of the exercise are validated.
###### +The question 1 is validated if the outputted plot looks like:
![alt text][ex5q1]
[ex5q1]: ../w2_day1_ex5_q1.png "Scatter plot "
###### +The question 2 is validated if the output is: `11808.867339751561`
###### +The question 3 is validated if `grid.shape` is `(640000,2)`.
###### +The question 4 is validated if the 10 first values of losses are:
```console
array([158315.41493175, 158001.96852692, 157689.02212209, 157376.57571726,
157064.62931244, 156753.18290761, 156442.23650278, 156131.79009795,
155821.84369312, 155512.39728829])
```
###### +The question 5 is validated if the outputted plot looks like
![alt text][ex5q5]
[ex5q5]: ../w2_day1_ex5_q5.png "MSE"
###### +The question 6 is validated if the point returned is:
`array([42.5, 99. ])`. It means that `a= 42.5` and `b=99`.
###### +The question 7 is validated if the coefficients returned are:
```console
Coefficients (a): 42.61943031121358
Intercept (b): 99.18581814447936
```
###### +The question 8 is validated if the outputted plot is:
![alt text][ex5q8]
[ex5q8]: ../w2_day1_ex5_q8.png "MSE + Gradient descent"
###### +The question 9 is validated if the coefficients and intercept returned are:
```console
Coefficients: [42.61943029]
Intercept: 99.18581817296929
```

BIN
subjects/ai/linear-regression-with-scikit-learn/w2_day01_linear_regression_video.gif

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 18 MiB

BIN
subjects/ai/linear-regression-with-scikit-learn/w2_day1_ex2_q1.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 39 KiB

BIN
subjects/ai/linear-regression-with-scikit-learn/w2_day1_ex2_q3.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 53 KiB

BIN
subjects/ai/linear-regression-with-scikit-learn/w2_day1_ex5_q1.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 41 KiB

BIN
subjects/ai/linear-regression-with-scikit-learn/w2_day1_ex5_q5.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 124 KiB

BIN
subjects/ai/linear-regression-with-scikit-learn/w2_day1_ex5_q6.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 54 KiB

BIN
subjects/ai/linear-regression-with-scikit-learn/w2_day1_ex5_q8.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 127 KiB

366
subjects/ai/machine-learning-pipeline/README.md

@ -0,0 +1,366 @@
# Machine Learning Pipeline
Today we will focus on the data preprocessing and discover the Pipeline object from scikit learn.
1. Manage categorical variables with Integer encoding and One Hot Encoding
2. Impute the missing values
3. Reduce the dimension of the data
4. Scale the data
- The **step 1** is always necessary. Models use numbers, for instance string data can't be processed raw.
- The **steps 2** is always necessary. Machine learning models use numbers, missing values do not have mathematical representations, that is why the missing values have to be imputed.
- The **step 3** is required when the dimension of the data set is high. The dimension reduction algorithms reduce the dimensionality of the data either by selecting the variables that contain most of the information (SelectKBest) or by transforming the data. Depending on the signal in the data and the data set size the dimension reduction is not always required. This step is not covered because of its complexity. The understanding of the theory behind is important. However, I suggest to give it a try during the projects.
- The **step 4** is required when using some type of Machine Learning algorithms. The Machine Learning algorithms that require the feature scaling are mostly KNN (K-Nearest Neighbors), Neural Networks, Linear Regression, and Logistic Regression. The reason why some algorithms work better with feature scaling is that the minimization of the loss function may be more difficult if each feature's range is completely different.
These steps are sequential. The output of step 1 is used as input for step 2 and so on; and, the output of step 4 is used as input for the Machine Learning model.
Scikitlearn proposes an object: Pipeline.
As we know, the model evaluation methodology requires to split the data set in a train set and test set. **The preprocessing is learned/fitted on the training set and applied on the test set**.
This object takes as input the preprocessing transforms and a Machine Learning model. Then this object can be called the same way a Machine Learning model is called. This is pretty practical because we do not need anymore to carry many objects.
### Exercises of the day
- Exercise 0: Environment and libraries
- Exercise 1: Imputer 1
- Exercise 2: Scaler
- Exercise 3: One hot Encoder
- Exercise 4: Ordinal Encoder
- Exercise 5: Categorical variables
- Exercise 6: Pipeline
### Virtual Environment
- Python 3.x
- NumPy
- Pandas
- Matplotlib
- Scikit Learn
- Jupyter or JupyterLab
_Version of Scikit Learn I used to do the exercises: 0.22_. I suggest to use the most recent one. Scikit Learn 1.0 is finally available after ... 14 years.
### **Resources**
### Step 3
- https://towardsdatascience.com/dimensionality-reduction-for-machine-learning-80a46c2ebb7e
### Step 4
- https://medium.com/@societyofai/simplest-way-for-feature-scaling-in-gradient-descent-ae0aaa383039#:~:text=Feature%20scaling%20is%20an%20idea,of%20convergence%20of%20gradient%20descent.
### Pipeline
- https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
---
---
# Exercise 0: Environment and libraries
The goal of this exercise is to set up the Python work environment with the required libraries.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
I recommend to use:
- the **last stable versions** of Python.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy`, `jupyter`, `matplotlib` and `scikit-learn`.
---
---
# Exercise 1: Imputer 1
The goal of this exercise is to learn how to use an Imputer to fill missing values on basic example.
```python
train_data = [[7, 6, 5],
[4, np.nan, 5],
[1, 20, 8]]
```
1. Fit the `SimpleImputer` on the data. Print the `statistics_`. Check that the statistics match `np.nanmean(train_data, axis=0)`.
2. Fill the missing values in `train_data` using the fitted imputer and `transform`.
3. Fill the missing values in `test_data` using the fitted imputer and `transform`.
```python
test_data = [[np.nan, 1, 2],
[7, np.nan, 9],
[np.nan, 2, 4]]
```
---
---
# Exercise 2: Scaler
The goal of this exercise is to learn to scale a data set. There are various scaling techniques, we will focus on `StandardScaler` from scikit learn.
We will use a tiny data set for this exercise that we will generate by ourselves:
```python
X_train = np.array([[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]])
```
1. Fit the `StandardScaler` on the data and scale X_train using `fit_transform`. Compute the `mean` and `std` on `axis 0`.
2. Scale the test set using the `StandardScaler` fitted on the train set.
```python
X_test = np.array([[ 2., -1., 1.],
[ 3., 3., -1.],
[ 1., 1., 1.]])
```
**WARNING:
If the data is split in train and test set, it is extremely important to apply the same scaling the test data. As the model is trained on scaled data, if it takes as input unscaled data, it returns incorrect values.**
Resources:
- https://medium.com/technofunnel/what-when-why-feature-scaling-for-machine-learning-standard-minmax-scaler-49e64c510422
- https://scikit-learn.org/stable/modules/preprocessing.html
---
---
# Exercise 3: One hot Encoder
The goal of this exercise is to learn how to deal with Categorical variables using the OneHot Encoder.
```python
X_train = [['Python'], ['Java'], ['Java'], ['C++']]
```
1. Using `OneHotEncoder` with `handle_unknown='ignore'`, fit the One Hot Encoder and transform X_train. The expected output is:
| | ('C++',) | ('Java',) | ('Python',) |
| --: | -------: | --------: | ----------: |
| 0 | 0 | 0 | 1 |
| 1 | 0 | 1 | 0 |
| 2 | 0 | 1 | 0 |
| 3 | 1 | 0 | 0 |
To get this output create a DataFrame from the transformed X*train and the attribute `categories*`.
2. Transform X_test using the fitted One Hot Encoder on the train set.
```python
X_test = [['Python'], ['Java'], ['C'], ['C++']]
```
The expected output is:
| | ('C++',) | ('Java',) | ('Python',) |
|---:|-----------:|------------:|--------------:|
| 0 | 0 | 0 | 1 |
| 1 | 0 | 1 | 0 |
| 2 | 0 | 0 | 0 |
| 3 | 1 | 0 | 0 |
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
---
---
# Exercise 4: Ordinal Encoder
The goal of this exercise is to learn how to deal with Categorical variables using the Ordinal Encoder.
In that case, we want the model to consider that: **good > neutral > bad**
```python
X_train = [['good'], ['bad'], ['neutral']]
```
1. Fit the `OrdinalEncoder` by specifying the categories in the following order: `categories=[['bad', 'neutral', 'good']]`. Transform the train set. Print the `categories_`
2. Transform the X_test using the fitted Ordinal Encoder on train set.
```python
X_test = [['good'], ['good'], ['bad']]
```
_Note: In the version 0.22 of Scikit-learn, the Ordinal Encoder doesn't handle new values in the test set. But it will be possible in the version 0.24 !_
---
---
# Exercise 5: Categorical variables
The goal of this exercise is to learn how to deal with Categorical variables with Ordinal Encoder, Label Encoder and One Hot Encoder. For this exercice I strongly suggest to use a recent version of `sklearn >= 0.24.1` to avoid issues with the Ordinal Encoder.
Preliminary:
- Load the breast-cancer.csv file
- Drop `Class` column
- Drop NaN values
- Split the data in a train set and test set (test set size = 20% of the total size) with `random_state=43`.
1. Count the number of unique values per feature in the train set.
2. Identify the variables ordinal variables, nominal variables and the target. Compute a One Hot Encoder transformation on the test set for all categorical features (no ordinal) in the following order `['node-caps' , 'breast', 'breast-quad', 'irradiat']`. Here are the assumptions made on the variables:
```console
age: Ordinal
['90-99' > '80-89' > '70-79' > '60-69' > '50-59' > '40-49' > '30-39' > '20-29' > '10-19']
menopause: Ordinal
['ge40'> 'premeno' >'lt40']
tumor-size: Ordinal
['55-59' > '50-54' > '45-49' > '40-44' > '35-39' > '30-34' > '25-29' > '20-24' > '15-19' > '10-14' > '5-9' > '0-4']
inv-nodes: Ordinal
['36-39' > '33-35' > '30-32' > '27-29' > '24-26' > '21-23' > '18-20' > '15-17' > '12-14' > '9-11' > '6-8' > '3-5' > '0-2']
node-caps: One Hot
['yes' 'no']
deg-malig: Ordinal
[3 > 2 > 1]
breast: One Hot
['left' 'right']
breast-quad: One Hot
['right_low' 'left_low' 'left_up' 'central' 'right_up']
irradiat: One Hot
['recurrence-events' 'no-recurrence-events']
```
- Fit on the train set
- Transform the test set
Example of expected output:
```console
# One Hot encoder on: ['node-caps' , 'breast', 'breast-quad', 'irradiat']
input: ohe.transform(X_test[ohe_cols])[:10]
output:
array([[1., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0.],
[1., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0.],
[1., 0., 1., 0., 0., 0., 0., 1., 0., 1., 0.],
[1., 0., 0., 1., 0., 1., 0., 0., 0., 1., 0.],
[1., 0., 0., 1., 0., 0., 0., 1., 0., 1., 0.],
[1., 0., 0., 1., 0., 0., 1., 0., 0., 1., 0.],
[1., 0., 0., 1., 0., 0., 1., 0., 0., 1., 0.],
[1., 0., 0., 1., 0., 1., 0., 0., 0., 1., 0.],
[1., 0., 0., 1., 0., 0., 1., 0., 0., 1., 0.],
[0., 1., 1., 0., 0., 0., 1., 0., 0., 0., 1.]])
input: ohe.get_feature_names(ohe_cols)
output:
array(['node-caps_no', 'node-caps_yes', 'breast_left', 'breast_right',
'breast-quad_central', 'breast-quad_left_low',
'breast-quad_left_up', 'breast-quad_right_low',
'breast-quad_right_up', 'irradiat_no', 'irradiat_yes'],
dtype=object)
```
3. Create one Ordinal encoder for all Ordinal features in the following order `["menopause", "age", "tumor-size","inv-nodes", "deg-malig"]` on the test set. The documentation of Scikit-learn is not clear on how to perform this on many columns at the same time. Here's a **hint**:
If the ordinal data set is (subset of two columns but I keep all rows for this example):
| | menopause | deg-malig |
|---:|:--------------|------------:|
| 0 | premeno | 3 |
| 1 | ge40 | 1 |
| 2 | ge40 | 2 |
| 3 | premeno | 3 |
| 4 | premeno | 2 |
The first step is to create a dictionnary or a list - the most recent version of sklearn take as input lists:
```console
dict_ = {0: ['lt40', 'premeno' , 'ge40'], 1:[1,2,3]}
```
Then to instantiate an `OrdinalEncoder`:
```console
oe = OrdinalEncoder(dict_)
```
Now that you have enough information:
- Fit on the train set
- Transform the test set
4. Use a `make_column_transformer` to combine the two Encoders.
- Fit on the train set
- Transform the test set
_Hint: Check the first ressource_
**Note: The version 0.22 of Scikit-learn can't handle `get_feature_names` on `OrdinalEncoder`. If the column transformer contains an `OrdinalEncoder`, the method returns this error**:
```console
AttributeError: Transformer ordinalencoder (type OrdinalEncoder) does not provide get_feature_names.
```
**It means that if you want to use the Ordinal Encoder, you will have to create a variable that contains the columns name in the right order. This step is not required in that exercise**
Ressources:
- https://towardsdatascience.com/guide-to-encoding-categorical-features-using-scikit-learn-for-machine-learning-5048997a5c79
- https://machinelearningmastery.com/one-hot-encoding-for-categorical-data/
---
---
# Exercise 6: Pipeline
The goal of this exercise is to learn to use the Scikit-learn object: Pipeline. The data set: used for this exercise is the `iris` data set.
Preliminary:
- Run the code below.
```console
iris = load_iris()
X, y = iris['data'], iris['target']
#add missing values
X[[1,20,50,100,135], 0] = np.nan
X[[2,5,88,135], 1] = np.nan
X[[4,15], 2] = np.nan
X[[40,135], 3] = np.nan
```
- Split the data set in a train set and test set (33%), fit the Pipeline on the train set and predict on the test set. Use `random_state=43`.
The pipeline you will implement has to contain 3 steps:
- Imputer (median)
- Standard Scaler
- LogisticRegression
1. Train the pipeline on the train set and predict on the test set. Give the score of the model on the test set.
---
---

202
subjects/ai/machine-learning-pipeline/audit/README.md

@ -0,0 +1,202 @@
#### Exercise 0: Environment and libraries
##### The exercice is validated is all questions of the exercice are validated.
##### Activate the virtual environment. If you used `conda` run `conda activate your_env`.
##### Run `python --version`.
###### Does it print `Python 3.x`? x >= 8
##### Does `import jupyter`, `import numpy`, `import pandas`, `import matplotlib` and `import sklearn` run without any error?
---
---
#### Exercise 1: Imputer 1
##### The exercise is validated is all questions of the exercise are validated.
##### The question 1 is validated if the `imp_mean.statistics_` returns:
```console
array([ 4., 13., 6.])
```
##### The question 2 is validated if the filled train set is:
```console
array([[ 7., 6., 5.],
[ 4., 13., 5.],
[ 1., 20., 8.]])
```
##### The question 3 is validated if the filled test set is:
```console
array([[ 4., 1., 2.],
[ 7., 13., 9.],
[ 4., 2., 4.]])
```
---
---
#### Exercise 2: Scaler
##### The exercise is validated is all. questions of the exercise are validated.
##### The question 1 is validated if the scaled train set is as below. And by definition, the mean on the axis 0 should be `array([0., 0., 0.])` and the standard deviation on the axis 0 should be `array([1., 1., 1.])`.
```console
array([[ 0. , -1.22474487, 1.33630621],
[ 1.22474487, 0. , -0.26726124],
[-1.22474487, 1.22474487, -1.06904497]])
```
##### The question 2 is validated if the scaled test set is:
```console
array([[ 1.22474487, -1.22474487, 0.53452248],
[ 2.44948974, 3.67423461, -1.06904497],
[ 0. , 1.22474487, 0.53452248]])
```
---
---
#### Exercise 3: One hot Encoder
##### The exercise is validated is all questions of the exercise are validated.
##### The question 1 is validated if the output is:
| | ('C++',) | ('Java',) | ('Python',) |
|---:|-----------:|------------:|--------------:|
| 0 | 0 | 0 | 1 |
| 1 | 0 | 1 | 0 |
| 2 | 0 | 1 | 0 |
| 3 | 1 | 0 | 0 |
##### The question 2 is validated if the output is:
| | ('C++',) | ('Java',) | ('Python',) |
|---:|-----------:|------------:|--------------:|
| 0 | 0 | 0 | 1 |
| 1 | 0 | 1 | 0 |
| 2 | 0 | 0 | 0 |
| 3 | 1 | 0 | 0 |
---
---
#### Exercise 4: Ordinal Encoder
##### The exercise is validated is all questions of the exercise are validated
##### The question 1 is validated if the output of the Ordinal Encoder on the train set is:
```console
array([[2.],
[0.],
[1.]])
```
Check that `enc.categories_` returns`[array(['bad', 'neutral', 'good'], dtype=object)]`.
##### The question 2 is validated if the output of the Ordinal Encoder on the test set is:
```console
array([[2.],
[2.],
[0.]])
```
---
---
#### Exercise 5: Categorical variables
##### The exercise is validated is all questions of the exercise are validated
##### The question 1 is validated if the number of unique values per feature outputted are:
```console
age 6
menopause 3
tumor-size 11
inv-nodes 6
node-caps 2
deg-malig 3
breast 2
breast-quad 5
irradiat 2
dtype: int64
```
##### The question 2 is validated if the transformed test set by the `OneHotEncoder` fitted on the train set is as below. Make sure the transformer takes as input a dataframe with the columns in the order defined `['node-caps' , 'breast', 'breast-quad', 'irradiat']` :
```console
#First 10 rows:
array([[1., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0.],
[1., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0.],
[1., 0., 1., 0., 0., 0., 0., 1., 0., 1., 0.],
[1., 0., 0., 1., 0., 1., 0., 0., 0., 1., 0.],
[1., 0., 0., 1., 0., 0., 0., 1., 0., 1., 0.],
[1., 0., 0., 1., 0., 0., 1., 0., 0., 1., 0.],
[1., 0., 0., 1., 0., 0., 1., 0., 0., 1., 0.],
[1., 0., 0., 1., 0., 1., 0., 0., 0., 1., 0.],
[1., 0., 0., 1., 0., 0., 1., 0., 0., 1., 0.],
[0., 1., 1., 0., 0., 0., 1., 0., 0., 0., 1.]])
```
##### The question 3 is validated if the transformed test set by the `OrdinalEncoder` fitted on the train set is as below with the columns ordered as `["menopause", "age", "tumor-size","inv-nodes", "deg-malig"]`:
```console
#First 10 rows:
array([[1., 2., 5., 0., 1.],
[1., 3., 4., 0., 1.],
[1., 2., 4., 0., 1.],
[1., 3., 2., 0., 1.],
[1., 4., 3., 0., 1.],
[1., 4., 5., 0., 0.],
[2., 5., 4., 0., 1.],
[2., 5., 8., 0., 1.],
[0., 2., 3., 0., 2.],
[1., 3., 6., 4., 2.]])
```
##### The question 4 is validated if the column transformer transformed that is fitted on the X_train, transformed the X_test as:
```console
# First 2 rows:
array([[1., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0., 1., 2., 5., 0., 1.],
[1., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0., 1., 3., 4., 0., 1.]])
```
---
---
#### Exercise 6: Pipeline
##### The question 1 is validated if the prediction on the test set are:
```console
array([0, 0, 2, 1, 2, 0, 2, 1, 1, 1, 0, 1, 2, 0, 1, 1, 0, 0, 2, 2, 0, 0,
0, 2, 2, 2, 0, 1, 0, 0, 1, 0, 1, 1, 2, 2, 1, 2, 1, 1, 1, 2, 1, 2,
0, 1, 1, 1, 1, 1])
```
and the score on the test set is **98%**.
**Note: Keep in mind that having a 98% accuracy is not common when working with real life data. Every time you have a score > 97% check that there's no leakage in the data. On financial data set, the ratio signal to noise is low. Trying to forecast stock prices is a difficult problem. Having an accuracy higher than 70% should be interpreted as a warning to check data leakage!**

286
subjects/ai/machine-learning-pipeline/data/breast-cancer.csv

@ -0,0 +1,286 @@
"40-49","premeno","15-19","0-2","yes","3","right","left_up","no","recurrence-events"
"50-59","ge40","15-19","0-2","no","1","right","central","no","no-recurrence-events"
"50-59","ge40","35-39","0-2","no","2","left","left_low","no","recurrence-events"
"40-49","premeno","35-39","0-2","yes","3","right","left_low","yes","no-recurrence-events"
"40-49","premeno","30-34","3-5","yes","2","left","right_up","no","recurrence-events"
"50-59","premeno","25-29","3-5","no","2","right","left_up","yes","no-recurrence-events"
"50-59","ge40","40-44","0-2","no","3","left","left_up","no","no-recurrence-events"
"40-49","premeno","10-14","0-2","no","2","left","left_up","no","no-recurrence-events"
"40-49","premeno","0-4","0-2","no","2","right","right_low","no","no-recurrence-events"
"40-49","ge40","40-44","15-17","yes","2","right","left_up","yes","no-recurrence-events"
"50-59","premeno","25-29","0-2","no","2","left","left_low","no","no-recurrence-events"
"60-69","ge40","15-19","0-2","no","2","right","left_up","no","no-recurrence-events"
"50-59","ge40","30-34","0-2","no","1","right","central","no","no-recurrence-events"
"50-59","ge40","25-29","0-2","no","2","right","left_up","no","no-recurrence-events"
"40-49","premeno","25-29","0-2","no","2","left","left_low","yes","recurrence-events"
"30-39","premeno","20-24","0-2","no","3","left","central","no","no-recurrence-events"
"50-59","premeno","10-14","3-5","no","1","right","left_up","no","no-recurrence-events"
"60-69","ge40","15-19","0-2","no","2","right","left_up","no","no-recurrence-events"
"50-59","premeno","40-44","0-2","no","2","left","left_up","no","no-recurrence-events"
"50-59","ge40","20-24","0-2","no","3","left","left_up","no","no-recurrence-events"
"50-59","lt40","20-24","0-2",nan,"1","left","left_low","no","recurrence-events"
"60-69","ge40","40-44","3-5","no","2","right","left_up","yes","no-recurrence-events"
"50-59","ge40","15-19","0-2","no","2","right","left_low","no","no-recurrence-events"
"40-49","premeno","10-14","0-2","no","1","right","left_up","no","no-recurrence-events"
"30-39","premeno","15-19","6-8","yes","3","left","left_low","yes","recurrence-events"
"50-59","ge40","20-24","3-5","yes","2","right","left_up","no","no-recurrence-events"
"50-59","ge40","10-14","0-2","no","2","right","left_low","no","no-recurrence-events"
"40-49","premeno","10-14","0-2","no","1","right","left_up","no","no-recurrence-events"
"60-69","ge40","30-34","3-5","yes","3","left","left_low","no","no-recurrence-events"
"40-49","premeno","15-19","15-17","yes","3","left","left_low","no","recurrence-events"
"60-69","ge40","30-34","0-2","no","3","right","central","no","recurrence-events"
"60-69","ge40","25-29","3-5",nan,"1","right","left_low","yes","no-recurrence-events"
"50-59","ge40","25-29","0-2","no","3","left","right_up","no","no-recurrence-events"
"50-59","ge40","20-24","0-2","no","3","right","left_up","no","no-recurrence-events"
"40-49","premeno","30-34","0-2","no","1","left","left_low","yes","recurrence-events"
"30-39","premeno","15-19","0-2","no","1","left","left_low","no","no-recurrence-events"
"40-49","premeno","10-14","0-2","no","2","right","left_up","no","no-recurrence-events"
"60-69","ge40","45-49","6-8","yes","3","left","central","no","no-recurrence-events"
"40-49","ge40","20-24","0-2","no","3","left","left_low","no","no-recurrence-events"
"40-49","premeno","10-14","0-2","no","1","right","right_low","no","no-recurrence-events"
"30-39","premeno","35-39","0-2","no","3","left","left_low","no","recurrence-events"
"40-49","premeno","35-39","9-11","yes","2","right","right_up","yes","no-recurrence-events"
"60-69","ge40","25-29","0-2","no","2","right","left_low","no","no-recurrence-events"
"50-59","ge40","20-24","3-5","yes","3","right","right_up","no","recurrence-events"
"30-39","premeno","15-19","0-2","no","1","left","left_low","no","no-recurrence-events"
"50-59","premeno","30-34","0-2","no","3","left","right_up","no","recurrence-events"
"60-69","ge40","10-14","0-2","no","2","right","left_up","yes","no-recurrence-events"
"40-49","premeno","35-39","0-2","yes","3","right","left_up","yes","no-recurrence-events"
"50-59","premeno","50-54","0-2","yes","2","right","left_up","yes","no-recurrence-events"
"50-59","ge40","40-44","0-2","no","3","right","left_up","no","no-recurrence-events"
"70-79","ge40","15-19","9-11",nan,"1","left","left_low","yes","recurrence-events"
"50-59","lt40","30-34","0-2","no","3","right","left_up","no","no-recurrence-events"
"40-49","premeno","0-4","0-2","no","3","left","central","no","no-recurrence-events"
"70-79","ge40","40-44","0-2","no","1","right","right_up","no","no-recurrence-events"
"40-49","premeno","25-29","0-2",nan,"2","left","right_low","yes","no-recurrence-events"
"50-59","ge40","25-29","15-17","yes","3","right","left_up","no","no-recurrence-events"
"50-59","premeno","20-24","0-2","no","1","left","left_low","no","no-recurrence-events"
"50-59","ge40","35-39","15-17","no","3","left","left_low","no","no-recurrence-events"
"50-59","ge40","50-54","0-2","no","1","right","right_up","no","no-recurrence-events"
"30-39","premeno","0-4","0-2","no","2","right","central","no","recurrence-events"
"50-59","ge40","40-44","6-8","yes","3","left","left_low","yes","recurrence-events"
"40-49","premeno","30-34","0-2","no","2","right","right_up","yes","no-recurrence-events"
"40-49","ge40","20-24","0-2","no","3","left","left_up","no","no-recurrence-events"
"40-49","premeno","30-34","15-17","yes","3","left","left_low","no","recurrence-events"
"40-49","ge40","20-24","0-2","no","2","right","left_up","no","recurrence-events"
"50-59","ge40","15-19","0-2","no","1","right","central","no","no-recurrence-events"
"30-39","premeno","25-29","0-2","no","2","right","left_low","no","no-recurrence-events"
"60-69","ge40","15-19","0-2","no","2","left","left_low","no","no-recurrence-events"
"50-59","premeno","50-54","9-11","yes","2","right","left_up","no","recurrence-events"
"30-39","premeno","10-14","0-2","no","1","right","left_low","no","no-recurrence-events"
"50-59","premeno","25-29","3-5","yes","3","left","left_low","yes","recurrence-events"
"60-69","ge40","25-29","3-5",nan,"1","right","left_up","yes","no-recurrence-events"
"60-69","ge40","10-14","0-2","no","1","right","left_low","no","no-recurrence-events"
"50-59","ge40","30-34","6-8","yes","3","left","right_low","no","recurrence-events"
"30-39","premeno","25-29","6-8","yes","3","left","right_low","yes","recurrence-events"
"50-59","ge40","10-14","0-2","no","1","left","left_low","no","no-recurrence-events"
"50-59","premeno","15-19","0-2","no","1","left","left_low","no","no-recurrence-events"
"40-49","premeno","25-29","0-2","no","2","right","central","no","no-recurrence-events"
"40-49","premeno","25-29","0-2","no","3","left","right_up","no","recurrence-events"
"60-69","ge40","30-34","6-8","yes","2","right","right_up","no","no-recurrence-events"
"50-59","lt40","15-19","0-2","no","2","left","left_low","no","no-recurrence-events"
"40-49","premeno","25-29","0-2","no","2","right","left_low","no","no-recurrence-events"
"40-49","premeno","30-34","0-2","no","1","right","left_up","no","no-recurrence-events"
"60-69","ge40","15-19","0-2","no","2","left","left_up","yes","no-recurrence-events"
"30-39","premeno","0-4","0-2","no","2","right","central","no","no-recurrence-events"
"50-59","ge40","35-39","0-2","no","3","left","left_up","no","no-recurrence-events"
"40-49","premeno","40-44","0-2","no","1","right","left_up","no","no-recurrence-events"
"30-39","premeno","25-29","6-8","yes","2","right","left_up","yes","no-recurrence-events"
"50-59","ge40","20-24","0-2","no","1","right","left_low","no","no-recurrence-events"
"50-59","ge40","30-34","0-2","no","1","left","left_up","no","no-recurrence-events"
"60-69","ge40","20-24","0-2","no","1","right","left_up","no","recurrence-events"
"30-39","premeno","30-34","3-5","no","3","right","left_up","yes","recurrence-events"
"50-59","lt40","20-24","0-2",nan,"1","left","left_up","no","recurrence-events"
"50-59","premeno","10-14","0-2","no","2","right","left_up","no","no-recurrence-events"
"50-59","ge40","20-24","0-2","no","2","right","left_up","no","no-recurrence-events"
"40-49","premeno","45-49","0-2","no","2","left","left_low","yes","no-recurrence-events"
"30-39","premeno","40-44","0-2","no","1","left","left_up","no","recurrence-events"
"50-59","premeno","10-14","0-2","no","1","left","left_low","no","no-recurrence-events"
"60-69","ge40","30-34","0-2","no","3","right","left_up","yes","recurrence-events"
"40-49","premeno","35-39","0-2","no","1","right","left_up","no","recurrence-events"
"40-49","premeno","20-24","3-5","yes","2","left","left_low","yes","recurrence-events"
"50-59","premeno","15-19","0-2","no","2","left","left_low","no","recurrence-events"
"50-59","ge40","30-34","0-2","no","3","right","left_low","no","no-recurrence-events"
"60-69","ge40","20-24","0-2","no","2","left","left_up","no","no-recurrence-events"
"40-49","premeno","20-24","0-2","no","1","left","right_low","no","no-recurrence-events"
"60-69","ge40","30-34","3-5","yes","2","left","central","yes","recurrence-events"
"60-69","ge40","20-24","3-5","no","2","left","left_low","yes","recurrence-events"
"50-59","premeno","25-29","0-2","no","2","left","right_up","no","recurrence-events"
"50-59","ge40","30-34","0-2","no","1","right","right_up","no","no-recurrence-events"
"40-49","premeno","20-24","0-2","no","2","left","right_low","no","no-recurrence-events"
"60-69","ge40","15-19","0-2","no","1","right","left_up","no","no-recurrence-events"
"60-69","ge40","30-34","0-2","no","2","left","left_low","yes","no-recurrence-events"
"30-39","premeno","30-34","0-2","no","2","left","left_up","no","no-recurrence-events"
"30-39","premeno","40-44","3-5","no","3","right","right_up","yes","no-recurrence-events"
"60-69","ge40","5-9","0-2","no","1","left","central","no","no-recurrence-events"
"60-69","ge40","10-14","0-2","no","1","left","left_up","no","no-recurrence-events"
"40-49","premeno","30-34","6-8","yes","3","right","left_up","no","recurrence-events"
"60-69","ge40","10-14","0-2","no","1","left","left_up","no","no-recurrence-events"
"40-49","premeno","35-39","9-11","yes","2","right","left_up","yes","no-recurrence-events"
"40-49","premeno","20-24","0-2","no","1","right","left_low","no","no-recurrence-events"
"40-49","premeno","30-34","0-2","yes","3","right","right_up","no","recurrence-events"
"50-59","premeno","25-29","0-2","yes","2","left","left_up","no","no-recurrence-events"
"40-49","premeno","15-19","0-2","no","2","left","left_low","no","no-recurrence-events"
"30-39","premeno","35-39","9-11","yes","3","left","left_low","no","recurrence-events"
"30-39","premeno","10-14","0-2","no","2","left","right_low","no","no-recurrence-events"
"50-59","ge40","30-34","0-2","no","1","right","left_low","no","no-recurrence-events"
"60-69","ge40","30-34","0-2","no","2","left","left_up","no","no-recurrence-events"
"60-69","ge40","25-29","0-2","no","2","left","left_low","no","no-recurrence-events"
"40-49","premeno","15-19","0-2","no","2","left","left_up","no","recurrence-events"
"60-69","ge40","15-19","0-2","no","2","right","left_low","no","no-recurrence-events"
"40-49","premeno","30-34","0-2","no","2","left","right_low","no","no-recurrence-events"
"20-29","premeno","35-39","0-2","no","2","right","right_up","no","no-recurrence-events"
"40-49","premeno","30-34","0-2","no","3","right","right_up","no","recurrence-events"
"40-49","premeno","25-29","0-2","no","2","right","left_low","no","recurrence-events"
"30-39","premeno","30-34","0-2","no","3","left","left_low","no","no-recurrence-events"
"30-39","premeno","15-19","0-2","no","1","right","left_low","no","recurrence-events"
"50-59","ge40","0-4","0-2","no","1","right","central","no","no-recurrence-events"
"50-59","ge40","0-4","0-2","no","1","left","left_low","no","no-recurrence-events"
"60-69","ge40","50-54","0-2","no","3","right","left_up","no","recurrence-events"
"50-59","premeno","30-34","0-2","no","1","left","central","no","no-recurrence-events"
"60-69","ge40","20-24","15-17","yes","3","left","left_low","yes","recurrence-events"
"40-49","premeno","25-29","0-2","no","2","left","left_up","no","no-recurrence-events"
"40-49","premeno","30-34","3-5","no","2","right","left_up","no","recurrence-events"
"50-59","premeno","20-24","3-5","yes","2","left","left_low","no","no-recurrence-events"
"50-59","ge40","15-19","0-2","yes","2","left","central","yes","no-recurrence-events"
"50-59","premeno","10-14","0-2","no","3","left","left_low","no","no-recurrence-events"
"30-39","premeno","30-34","9-11","no","2","right","left_up","yes","recurrence-events"
"60-69","ge40","10-14","0-2","no","1","left","left_low","no","no-recurrence-events"
"40-49","premeno","40-44","0-2","no","2","right","left_low","no","no-recurrence-events"
"50-59","ge40","30-34","9-11",nan,"3","left","left_up","yes","no-recurrence-events"
"40-49","premeno","50-54","0-2","no","2","right","left_low","yes","recurrence-events"
"50-59","ge40","15-19","0-2","no","2","right","right_up","no","no-recurrence-events"
"50-59","ge40","40-44","3-5","yes","2","left","left_low","no","no-recurrence-events"
"30-39","premeno","25-29","3-5","yes","3","left","left_low","yes","recurrence-events"
"60-69","ge40","10-14","0-2","no","2","left","left_low","no","no-recurrence-events"
"60-69","lt40","10-14","0-2","no","1","left","right_up","no","no-recurrence-events"
"30-39","premeno","30-34","0-2","no","2","left","left_up","no","recurrence-events"
"30-39","premeno","20-24","3-5","yes","2","left","left_low","no","recurrence-events"
"50-59","ge40","10-14","0-2","no","1","right","left_up","no","no-recurrence-events"
"60-69","ge40","25-29","0-2","no","3","right","left_up","no","no-recurrence-events"
"50-59","ge40","25-29","3-5","yes","3","right","left_up","no","no-recurrence-events"
"40-49","premeno","30-34","6-8","no","2","left","left_up","no","no-recurrence-events"
"60-69","ge40","50-54","0-2","no","2","left","left_low","no","no-recurrence-events"
"50-59","premeno","30-34","0-2","no","3","left","left_low","no","no-recurrence-events"
"40-49","ge40","20-24","3-5","no","3","right","left_low","yes","recurrence-events"
"50-59","ge40","30-34","6-8","yes","2","left","right_low","yes","recurrence-events"
"60-69","ge40","25-29","3-5","no","2","right","right_up","no","recurrence-events"
"40-49","premeno","20-24","0-2","no","2","left","central","no","no-recurrence-events"
"40-49","premeno","20-24","0-2","no","2","left","left_up","no","no-recurrence-events"
"40-49","premeno","50-54","0-2","no","2","left","left_low","no","no-recurrence-events"
"50-59","ge40","20-24","0-2","no","2","right","central","no","recurrence-events"
"50-59","ge40","30-34","3-5","no","3","right","left_up","no","recurrence-events"
"40-49","ge40","25-29","0-2","no","2","left","left_low","no","no-recurrence-events"
"50-59","premeno","25-29","0-2","no","1","right","left_up","no","recurrence-events"
"40-49","premeno","40-44","3-5","yes","3","right","left_up","yes","no-recurrence-events"
"40-49","premeno","20-24","0-2","no","2","right","left_up","no","no-recurrence-events"
"40-49","premeno","20-24","3-5","no","2","right","left_up","no","no-recurrence-events"
"40-49","premeno","25-29","9-11","yes","3","right","left_up","no","recurrence-events"
"40-49","premeno","25-29","0-2","no","2","right","left_low","no","recurrence-events"
"40-49","premeno","20-24","0-2","no","1","right","right_up","no","no-recurrence-events"
"30-39","premeno","40-44","0-2","no","2","right","right_up","no","no-recurrence-events"
"60-69","ge40","10-14","6-8","yes","3","left","left_up","yes","recurrence-events"
"40-49","premeno","35-39","0-2","no","1","left","left_low","no","no-recurrence-events"
"50-59","ge40","30-34","3-5","no","3","left","left_low","no","recurrence-events"
"40-49","premeno","5-9","0-2","no","1","left","left_low","yes","no-recurrence-events"
"60-69","ge40","15-19","0-2","no","1","left","right_low","no","no-recurrence-events"
"40-49","premeno","30-34","0-2","no","3","right","right_up","no","no-recurrence-events"
"40-49","premeno","25-29","0-2","no","3","left","left_up","no","recurrence-events"
"50-59","ge40","5-9","0-2","no","2","right","right_up","no","no-recurrence-events"
"50-59","premeno","25-29","0-2","no","2","right","right_low","no","no-recurrence-events"
"50-59","premeno","25-29","0-2","no","2","left","right_up","no","recurrence-events"
"40-49","premeno","10-14","0-2","no","2","left","left_low","yes","no-recurrence-events"
"60-69","ge40","35-39","6-8","yes","3","left","left_low","no","recurrence-events"
"60-69","ge40","50-54","0-2","no","2","right","left_up","yes","no-recurrence-events"
"40-49","premeno","25-29","0-2","no","2","right","left_up","no","no-recurrence-events"
"30-39","premeno","20-24","3-5","no","2","right","central","no","no-recurrence-events"
"30-39","premeno","30-34","0-2","no","1","right","left_up","no","recurrence-events"
"60-69","lt40","30-34","0-2","no","1","left","left_low","no","no-recurrence-events"
"40-49","premeno","15-19","12-14","no","3","right","right_low","yes","no-recurrence-events"
"60-69","ge40","20-24","0-2","no","3","right","left_low","no","recurrence-events"
"30-39","premeno","5-9","0-2","no","2","left","right_low","no","no-recurrence-events"
"40-49","premeno","30-34","0-2","no","3","left","left_up","no","no-recurrence-events"
"60-69","ge40","30-34","0-2","no","3","left","left_low","no","no-recurrence-events"
"40-49","premeno","25-29","0-2","no","1","right","right_low","no","no-recurrence-events"
"40-49","premeno","25-29","0-2","no","1","left","right_low","no","no-recurrence-events"
"60-69","ge40","40-44","3-5","yes","3","right","left_low","no","recurrence-events"
"50-59","ge40","25-29","0-2","no","2","left","left_low","no","no-recurrence-events"
"50-59","premeno","30-34","0-2","no","3","right","left_up","yes","recurrence-events"
"40-49","ge40","30-34","3-5","no","3","left","left_low","no","recurrence-events"
"40-49","premeno","25-29","0-2","no","1","right","left_low","yes","no-recurrence-events"
"40-49","ge40","25-29","12-14","yes","3","left","right_low","yes","recurrence-events"
"40-49","premeno","40-44","0-2","no","1","left","left_low","no","recurrence-events"
"40-49","premeno","20-24","0-2","no","2","left","left_low","no","no-recurrence-events"
"50-59","ge40","25-29","0-2","no","1","left","right_low","no","no-recurrence-events"
"40-49","premeno","20-24","0-2","no","2","right","left_up","no","no-recurrence-events"
"70-79","ge40","40-44","0-2","no","1","right","left_up","no","no-recurrence-events"
"60-69","ge40","25-29","0-2","no","3","left","left_up","no","recurrence-events"
"50-59","premeno","25-29","0-2","no","2","left","left_low","no","no-recurrence-events"
"60-69","ge40","45-49","0-2","no","1","right","right_up","yes","recurrence-events"
"50-59","ge40","20-24","0-2","yes","2","right","left_up","no","no-recurrence-events"
"50-59","ge40","25-29","0-2","no","1","left","left_low","no","no-recurrence-events"
"50-59","ge40","20-24","0-2","no","3","left","left_up","no","no-recurrence-events"
"40-49","premeno","20-24","3-5","no","2","right","left_low","no","no-recurrence-events"
"50-59","ge40","35-39","0-2","no","2","left","left_up","no","no-recurrence-events"
"30-39","premeno","20-24","0-2","no","3","left","left_up","yes","recurrence-events"
"60-69","ge40","30-34","0-2","no","1","right","left_up","no","no-recurrence-events"
"60-69","ge40","25-29","0-2","no","3","right","left_low","no","no-recurrence-events"
"40-49","ge40","30-34","0-2","no","2","left","left_up","yes","no-recurrence-events"
"30-39","premeno","25-29","0-2","no","2","left","left_low","no","no-recurrence-events"
"40-49","premeno","20-24","0-2","no","2","left","left_low","no","recurrence-events"
"30-39","premeno","20-24","0-2","no","2","left","right_low","no","no-recurrence-events"
"40-49","premeno","10-14","0-2","no","2","right","left_low","no","no-recurrence-events"
"50-59","premeno","15-19","0-2","no","2","right","right_low","no","no-recurrence-events"
"50-59","premeno","25-29","0-2","no","1","right","left_up","no","no-recurrence-events"
"60-69","ge40","20-24","0-2","no","2","right","left_up","no","no-recurrence-events"
"60-69","ge40","40-44","0-2","no","2","right","left_low","no","recurrence-events"
"30-39","lt40","15-19","0-2","no","3","right","left_up","no","no-recurrence-events"
"40-49","premeno","30-34","12-14","yes","3","left","left_up","yes","recurrence-events"
"60-69","ge40","30-34","0-2","yes","2","right","right_up","yes","recurrence-events"
"50-59","ge40","40-44","6-8","yes","3","left","left_low","yes","recurrence-events"
"50-59","ge40","30-34","0-2","no","3","left",nan,"no","recurrence-events"
"70-79","ge40","10-14","0-2","no","2","left","central","no","no-recurrence-events"
"30-39","premeno","40-44","0-2","no","2","left","left_low","yes","no-recurrence-events"
"40-49","premeno","30-34","0-2","no","2","right","right_low","no","no-recurrence-events"
"40-49","premeno","30-34","0-2","no","1","left","left_low","no","no-recurrence-events"
"60-69","ge40","15-19","0-2","no","2","left","left_low","no","no-recurrence-events"
"40-49","premeno","10-14","0-2","no","2","left","left_low","no","no-recurrence-events"
"60-69","ge40","20-24","0-2","no","1","left","left_low","no","no-recurrence-events"
"50-59","ge40","10-14","0-2","no","1","left","left_up","no","no-recurrence-events"
"50-59","premeno","25-29","0-2","no","1","left","left_low","no","no-recurrence-events"
"50-59","ge40","30-34","9-11","yes","3","left","right_low","yes","recurrence-events"
"50-59","ge40","10-14","0-2","no","2","left","left_low","no","no-recurrence-events"
"40-49","premeno","30-34","0-2","no","1","left","right_up","no","no-recurrence-events"
"70-79","ge40","0-4","0-2","no","1","left","right_low","no","no-recurrence-events"
"40-49","premeno","25-29","0-2","no","3","right","left_up","yes","no-recurrence-events"
"50-59","premeno","25-29","0-2","no","3","right","left_low","yes","recurrence-events"
"50-59","ge40","40-44","0-2","no","2","left","left_low","no","no-recurrence-events"
"60-69","ge40","25-29","0-2","no","3","left","right_low","yes","recurrence-events"
"40-49","premeno","30-34","3-5","yes","2","right","left_low","no","no-recurrence-events"
"50-59","ge40","20-24","0-2","no","2","left","left_up","no","recurrence-events"
"70-79","ge40","20-24","0-2","no","3","left","left_up","no","no-recurrence-events"
"30-39","premeno","25-29","0-2","no","1","left","central","no","no-recurrence-events"
"60-69","ge40","30-34","0-2","no","2","left","left_low","no","no-recurrence-events"
"40-49","premeno","20-24","3-5","yes","2","right","right_up","yes","recurrence-events"
"50-59","ge40","30-34","9-11",nan,"3","left","left_low","yes","no-recurrence-events"
"50-59","ge40","0-4","0-2","no","2","left","central","no","no-recurrence-events"
"40-49","premeno","20-24","0-2","no","3","right","left_low","yes","no-recurrence-events"
"30-39","premeno","35-39","0-2","no","3","left","left_low","no","recurrence-events"
"60-69","ge40","30-34","0-2","no","1","left","left_up","no","no-recurrence-events"
"60-69","ge40","20-24","0-2","no","1","left","left_low","no","no-recurrence-events"
"50-59","ge40","25-29","6-8","no","3","left","left_low","yes","recurrence-events"
"50-59","premeno","35-39","15-17","yes","3","right","right_up","no","recurrence-events"
"30-39","premeno","20-24","3-5","yes","2","right","left_up","yes","no-recurrence-events"
"40-49","premeno","20-24","6-8","no","2","right","left_low","yes","no-recurrence-events"
"50-59","ge40","35-39","0-2","no","3","left","left_low","no","no-recurrence-events"
"50-59","premeno","35-39","0-2","no","2","right","left_up","no","no-recurrence-events"
"40-49","premeno","25-29","0-2","no","2","left","left_up","yes","no-recurrence-events"
"40-49","premeno","35-39","0-2","no","2","right","right_up","no","no-recurrence-events"
"50-59","premeno","30-34","3-5","yes","2","left","left_low","yes","no-recurrence-events"
"40-49","premeno","20-24","0-2","no","2","right","right_up","no","no-recurrence-events"
"60-69","ge40","15-19","0-2","no","3","right","left_up","yes","no-recurrence-events"
"50-59","ge40","30-34","6-8","yes","2","left","left_low","no","no-recurrence-events"
"50-59","premeno","25-29","3-5","yes","2","left","left_low","yes","no-recurrence-events"
"30-39","premeno","30-34","6-8","yes","2","right","right_up","no","no-recurrence-events"
"50-59","premeno","15-19","0-2","no","2","right","left_low","no","no-recurrence-events"
"50-59","ge40","40-44","0-2","no","3","left","right_up","no","no-recurrence-events"
1 40-49 premeno 15-19 0-2 yes 3 right left_up no recurrence-events
2 50-59 ge40 15-19 0-2 no 1 right central no no-recurrence-events
3 50-59 ge40 35-39 0-2 no 2 left left_low no recurrence-events
4 40-49 premeno 35-39 0-2 yes 3 right left_low yes no-recurrence-events
5 40-49 premeno 30-34 3-5 yes 2 left right_up no recurrence-events
6 50-59 premeno 25-29 3-5 no 2 right left_up yes no-recurrence-events
7 50-59 ge40 40-44 0-2 no 3 left left_up no no-recurrence-events
8 40-49 premeno 10-14 0-2 no 2 left left_up no no-recurrence-events
9 40-49 premeno 0-4 0-2 no 2 right right_low no no-recurrence-events
10 40-49 ge40 40-44 15-17 yes 2 right left_up yes no-recurrence-events
11 50-59 premeno 25-29 0-2 no 2 left left_low no no-recurrence-events
12 60-69 ge40 15-19 0-2 no 2 right left_up no no-recurrence-events
13 50-59 ge40 30-34 0-2 no 1 right central no no-recurrence-events
14 50-59 ge40 25-29 0-2 no 2 right left_up no no-recurrence-events
15 40-49 premeno 25-29 0-2 no 2 left left_low yes recurrence-events
16 30-39 premeno 20-24 0-2 no 3 left central no no-recurrence-events
17 50-59 premeno 10-14 3-5 no 1 right left_up no no-recurrence-events
18 60-69 ge40 15-19 0-2 no 2 right left_up no no-recurrence-events
19 50-59 premeno 40-44 0-2 no 2 left left_up no no-recurrence-events
20 50-59 ge40 20-24 0-2 no 3 left left_up no no-recurrence-events
21 50-59 lt40 20-24 0-2 nan 1 left left_low no recurrence-events
22 60-69 ge40 40-44 3-5 no 2 right left_up yes no-recurrence-events
23 50-59 ge40 15-19 0-2 no 2 right left_low no no-recurrence-events
24 40-49 premeno 10-14 0-2 no 1 right left_up no no-recurrence-events
25 30-39 premeno 15-19 6-8 yes 3 left left_low yes recurrence-events
26 50-59 ge40 20-24 3-5 yes 2 right left_up no no-recurrence-events
27 50-59 ge40 10-14 0-2 no 2 right left_low no no-recurrence-events
28 40-49 premeno 10-14 0-2 no 1 right left_up no no-recurrence-events
29 60-69 ge40 30-34 3-5 yes 3 left left_low no no-recurrence-events
30 40-49 premeno 15-19 15-17 yes 3 left left_low no recurrence-events
31 60-69 ge40 30-34 0-2 no 3 right central no recurrence-events
32 60-69 ge40 25-29 3-5 nan 1 right left_low yes no-recurrence-events
33 50-59 ge40 25-29 0-2 no 3 left right_up no no-recurrence-events
34 50-59 ge40 20-24 0-2 no 3 right left_up no no-recurrence-events
35 40-49 premeno 30-34 0-2 no 1 left left_low yes recurrence-events
36 30-39 premeno 15-19 0-2 no 1 left left_low no no-recurrence-events
37 40-49 premeno 10-14 0-2 no 2 right left_up no no-recurrence-events
38 60-69 ge40 45-49 6-8 yes 3 left central no no-recurrence-events
39 40-49 ge40 20-24 0-2 no 3 left left_low no no-recurrence-events
40 40-49 premeno 10-14 0-2 no 1 right right_low no no-recurrence-events
41 30-39 premeno 35-39 0-2 no 3 left left_low no recurrence-events
42 40-49 premeno 35-39 9-11 yes 2 right right_up yes no-recurrence-events
43 60-69 ge40 25-29 0-2 no 2 right left_low no no-recurrence-events
44 50-59 ge40 20-24 3-5 yes 3 right right_up no recurrence-events
45 30-39 premeno 15-19 0-2 no 1 left left_low no no-recurrence-events
46 50-59 premeno 30-34 0-2 no 3 left right_up no recurrence-events
47 60-69 ge40 10-14 0-2 no 2 right left_up yes no-recurrence-events
48 40-49 premeno 35-39 0-2 yes 3 right left_up yes no-recurrence-events
49 50-59 premeno 50-54 0-2 yes 2 right left_up yes no-recurrence-events
50 50-59 ge40 40-44 0-2 no 3 right left_up no no-recurrence-events
51 70-79 ge40 15-19 9-11 nan 1 left left_low yes recurrence-events
52 50-59 lt40 30-34 0-2 no 3 right left_up no no-recurrence-events
53 40-49 premeno 0-4 0-2 no 3 left central no no-recurrence-events
54 70-79 ge40 40-44 0-2 no 1 right right_up no no-recurrence-events
55 40-49 premeno 25-29 0-2 nan 2 left right_low yes no-recurrence-events
56 50-59 ge40 25-29 15-17 yes 3 right left_up no no-recurrence-events
57 50-59 premeno 20-24 0-2 no 1 left left_low no no-recurrence-events
58 50-59 ge40 35-39 15-17 no 3 left left_low no no-recurrence-events
59 50-59 ge40 50-54 0-2 no 1 right right_up no no-recurrence-events
60 30-39 premeno 0-4 0-2 no 2 right central no recurrence-events
61 50-59 ge40 40-44 6-8 yes 3 left left_low yes recurrence-events
62 40-49 premeno 30-34 0-2 no 2 right right_up yes no-recurrence-events
63 40-49 ge40 20-24 0-2 no 3 left left_up no no-recurrence-events
64 40-49 premeno 30-34 15-17 yes 3 left left_low no recurrence-events
65 40-49 ge40 20-24 0-2 no 2 right left_up no recurrence-events
66 50-59 ge40 15-19 0-2 no 1 right central no no-recurrence-events
67 30-39 premeno 25-29 0-2 no 2 right left_low no no-recurrence-events
68 60-69 ge40 15-19 0-2 no 2 left left_low no no-recurrence-events
69 50-59 premeno 50-54 9-11 yes 2 right left_up no recurrence-events
70 30-39 premeno 10-14 0-2 no 1 right left_low no no-recurrence-events
71 50-59 premeno 25-29 3-5 yes 3 left left_low yes recurrence-events
72 60-69 ge40 25-29 3-5 nan 1 right left_up yes no-recurrence-events
73 60-69 ge40 10-14 0-2 no 1 right left_low no no-recurrence-events
74 50-59 ge40 30-34 6-8 yes 3 left right_low no recurrence-events
75 30-39 premeno 25-29 6-8 yes 3 left right_low yes recurrence-events
76 50-59 ge40 10-14 0-2 no 1 left left_low no no-recurrence-events
77 50-59 premeno 15-19 0-2 no 1 left left_low no no-recurrence-events
78 40-49 premeno 25-29 0-2 no 2 right central no no-recurrence-events
79 40-49 premeno 25-29 0-2 no 3 left right_up no recurrence-events
80 60-69 ge40 30-34 6-8 yes 2 right right_up no no-recurrence-events
81 50-59 lt40 15-19 0-2 no 2 left left_low no no-recurrence-events
82 40-49 premeno 25-29 0-2 no 2 right left_low no no-recurrence-events
83 40-49 premeno 30-34 0-2 no 1 right left_up no no-recurrence-events
84 60-69 ge40 15-19 0-2 no 2 left left_up yes no-recurrence-events
85 30-39 premeno 0-4 0-2 no 2 right central no no-recurrence-events
86 50-59 ge40 35-39 0-2 no 3 left left_up no no-recurrence-events
87 40-49 premeno 40-44 0-2 no 1 right left_up no no-recurrence-events
88 30-39 premeno 25-29 6-8 yes 2 right left_up yes no-recurrence-events
89 50-59 ge40 20-24 0-2 no 1 right left_low no no-recurrence-events
90 50-59 ge40 30-34 0-2 no 1 left left_up no no-recurrence-events
91 60-69 ge40 20-24 0-2 no 1 right left_up no recurrence-events
92 30-39 premeno 30-34 3-5 no 3 right left_up yes recurrence-events
93 50-59 lt40 20-24 0-2 nan 1 left left_up no recurrence-events
94 50-59 premeno 10-14 0-2 no 2 right left_up no no-recurrence-events
95 50-59 ge40 20-24 0-2 no 2 right left_up no no-recurrence-events
96 40-49 premeno 45-49 0-2 no 2 left left_low yes no-recurrence-events
97 30-39 premeno 40-44 0-2 no 1 left left_up no recurrence-events
98 50-59 premeno 10-14 0-2 no 1 left left_low no no-recurrence-events
99 60-69 ge40 30-34 0-2 no 3 right left_up yes recurrence-events
100 40-49 premeno 35-39 0-2 no 1 right left_up no recurrence-events
101 40-49 premeno 20-24 3-5 yes 2 left left_low yes recurrence-events
102 50-59 premeno 15-19 0-2 no 2 left left_low no recurrence-events
103 50-59 ge40 30-34 0-2 no 3 right left_low no no-recurrence-events
104 60-69 ge40 20-24 0-2 no 2 left left_up no no-recurrence-events
105 40-49 premeno 20-24 0-2 no 1 left right_low no no-recurrence-events
106 60-69 ge40 30-34 3-5 yes 2 left central yes recurrence-events
107 60-69 ge40 20-24 3-5 no 2 left left_low yes recurrence-events
108 50-59 premeno 25-29 0-2 no 2 left right_up no recurrence-events
109 50-59 ge40 30-34 0-2 no 1 right right_up no no-recurrence-events
110 40-49 premeno 20-24 0-2 no 2 left right_low no no-recurrence-events
111 60-69 ge40 15-19 0-2 no 1 right left_up no no-recurrence-events
112 60-69 ge40 30-34 0-2 no 2 left left_low yes no-recurrence-events
113 30-39 premeno 30-34 0-2 no 2 left left_up no no-recurrence-events
114 30-39 premeno 40-44 3-5 no 3 right right_up yes no-recurrence-events
115 60-69 ge40 5-9 0-2 no 1 left central no no-recurrence-events
116 60-69 ge40 10-14 0-2 no 1 left left_up no no-recurrence-events
117 40-49 premeno 30-34 6-8 yes 3 right left_up no recurrence-events
118 60-69 ge40 10-14 0-2 no 1 left left_up no no-recurrence-events
119 40-49 premeno 35-39 9-11 yes 2 right left_up yes no-recurrence-events
120 40-49 premeno 20-24 0-2 no 1 right left_low no no-recurrence-events
121 40-49 premeno 30-34 0-2 yes 3 right right_up no recurrence-events
122 50-59 premeno 25-29 0-2 yes 2 left left_up no no-recurrence-events
123 40-49 premeno 15-19 0-2 no 2 left left_low no no-recurrence-events
124 30-39 premeno 35-39 9-11 yes 3 left left_low no recurrence-events
125 30-39 premeno 10-14 0-2 no 2 left right_low no no-recurrence-events
126 50-59 ge40 30-34 0-2 no 1 right left_low no no-recurrence-events
127 60-69 ge40 30-34 0-2 no 2 left left_up no no-recurrence-events
128 60-69 ge40 25-29 0-2 no 2 left left_low no no-recurrence-events
129 40-49 premeno 15-19 0-2 no 2 left left_up no recurrence-events
130 60-69 ge40 15-19 0-2 no 2 right left_low no no-recurrence-events
131 40-49 premeno 30-34 0-2 no 2 left right_low no no-recurrence-events
132 20-29 premeno 35-39 0-2 no 2 right right_up no no-recurrence-events
133 40-49 premeno 30-34 0-2 no 3 right right_up no recurrence-events
134 40-49 premeno 25-29 0-2 no 2 right left_low no recurrence-events
135 30-39 premeno 30-34 0-2 no 3 left left_low no no-recurrence-events
136 30-39 premeno 15-19 0-2 no 1 right left_low no recurrence-events
137 50-59 ge40 0-4 0-2 no 1 right central no no-recurrence-events
138 50-59 ge40 0-4 0-2 no 1 left left_low no no-recurrence-events
139 60-69 ge40 50-54 0-2 no 3 right left_up no recurrence-events
140 50-59 premeno 30-34 0-2 no 1 left central no no-recurrence-events
141 60-69 ge40 20-24 15-17 yes 3 left left_low yes recurrence-events
142 40-49 premeno 25-29 0-2 no 2 left left_up no no-recurrence-events
143 40-49 premeno 30-34 3-5 no 2 right left_up no recurrence-events
144 50-59 premeno 20-24 3-5 yes 2 left left_low no no-recurrence-events
145 50-59 ge40 15-19 0-2 yes 2 left central yes no-recurrence-events
146 50-59 premeno 10-14 0-2 no 3 left left_low no no-recurrence-events
147 30-39 premeno 30-34 9-11 no 2 right left_up yes recurrence-events
148 60-69 ge40 10-14 0-2 no 1 left left_low no no-recurrence-events
149 40-49 premeno 40-44 0-2 no 2 right left_low no no-recurrence-events
150 50-59 ge40 30-34 9-11 nan 3 left left_up yes no-recurrence-events
151 40-49 premeno 50-54 0-2 no 2 right left_low yes recurrence-events
152 50-59 ge40 15-19 0-2 no 2 right right_up no no-recurrence-events
153 50-59 ge40 40-44 3-5 yes 2 left left_low no no-recurrence-events
154 30-39 premeno 25-29 3-5 yes 3 left left_low yes recurrence-events
155 60-69 ge40 10-14 0-2 no 2 left left_low no no-recurrence-events
156 60-69 lt40 10-14 0-2 no 1 left right_up no no-recurrence-events
157 30-39 premeno 30-34 0-2 no 2 left left_up no recurrence-events
158 30-39 premeno 20-24 3-5 yes 2 left left_low no recurrence-events
159 50-59 ge40 10-14 0-2 no 1 right left_up no no-recurrence-events
160 60-69 ge40 25-29 0-2 no 3 right left_up no no-recurrence-events
161 50-59 ge40 25-29 3-5 yes 3 right left_up no no-recurrence-events
162 40-49 premeno 30-34 6-8 no 2 left left_up no no-recurrence-events
163 60-69 ge40 50-54 0-2 no 2 left left_low no no-recurrence-events
164 50-59 premeno 30-34 0-2 no 3 left left_low no no-recurrence-events
165 40-49 ge40 20-24 3-5 no 3 right left_low yes recurrence-events
166 50-59 ge40 30-34 6-8 yes 2 left right_low yes recurrence-events
167 60-69 ge40 25-29 3-5 no 2 right right_up no recurrence-events
168 40-49 premeno 20-24 0-2 no 2 left central no no-recurrence-events
169 40-49 premeno 20-24 0-2 no 2 left left_up no no-recurrence-events
170 40-49 premeno 50-54 0-2 no 2 left left_low no no-recurrence-events
171 50-59 ge40 20-24 0-2 no 2 right central no recurrence-events
172 50-59 ge40 30-34 3-5 no 3 right left_up no recurrence-events
173 40-49 ge40 25-29 0-2 no 2 left left_low no no-recurrence-events
174 50-59 premeno 25-29 0-2 no 1 right left_up no recurrence-events
175 40-49 premeno 40-44 3-5 yes 3 right left_up yes no-recurrence-events
176 40-49 premeno 20-24 0-2 no 2 right left_up no no-recurrence-events
177 40-49 premeno 20-24 3-5 no 2 right left_up no no-recurrence-events
178 40-49 premeno 25-29 9-11 yes 3 right left_up no recurrence-events
179 40-49 premeno 25-29 0-2 no 2 right left_low no recurrence-events
180 40-49 premeno 20-24 0-2 no 1 right right_up no no-recurrence-events
181 30-39 premeno 40-44 0-2 no 2 right right_up no no-recurrence-events
182 60-69 ge40 10-14 6-8 yes 3 left left_up yes recurrence-events
183 40-49 premeno 35-39 0-2 no 1 left left_low no no-recurrence-events
184 50-59 ge40 30-34 3-5 no 3 left left_low no recurrence-events
185 40-49 premeno 5-9 0-2 no 1 left left_low yes no-recurrence-events
186 60-69 ge40 15-19 0-2 no 1 left right_low no no-recurrence-events
187 40-49 premeno 30-34 0-2 no 3 right right_up no no-recurrence-events
188 40-49 premeno 25-29 0-2 no 3 left left_up no recurrence-events
189 50-59 ge40 5-9 0-2 no 2 right right_up no no-recurrence-events
190 50-59 premeno 25-29 0-2 no 2 right right_low no no-recurrence-events
191 50-59 premeno 25-29 0-2 no 2 left right_up no recurrence-events
192 40-49 premeno 10-14 0-2 no 2 left left_low yes no-recurrence-events
193 60-69 ge40 35-39 6-8 yes 3 left left_low no recurrence-events
194 60-69 ge40 50-54 0-2 no 2 right left_up yes no-recurrence-events
195 40-49 premeno 25-29 0-2 no 2 right left_up no no-recurrence-events
196 30-39 premeno 20-24 3-5 no 2 right central no no-recurrence-events
197 30-39 premeno 30-34 0-2 no 1 right left_up no recurrence-events
198 60-69 lt40 30-34 0-2 no 1 left left_low no no-recurrence-events
199 40-49 premeno 15-19 12-14 no 3 right right_low yes no-recurrence-events
200 60-69 ge40 20-24 0-2 no 3 right left_low no recurrence-events
201 30-39 premeno 5-9 0-2 no 2 left right_low no no-recurrence-events
202 40-49 premeno 30-34 0-2 no 3 left left_up no no-recurrence-events
203 60-69 ge40 30-34 0-2 no 3 left left_low no no-recurrence-events
204 40-49 premeno 25-29 0-2 no 1 right right_low no no-recurrence-events
205 40-49 premeno 25-29 0-2 no 1 left right_low no no-recurrence-events
206 60-69 ge40 40-44 3-5 yes 3 right left_low no recurrence-events
207 50-59 ge40 25-29 0-2 no 2 left left_low no no-recurrence-events
208 50-59 premeno 30-34 0-2 no 3 right left_up yes recurrence-events
209 40-49 ge40 30-34 3-5 no 3 left left_low no recurrence-events
210 40-49 premeno 25-29 0-2 no 1 right left_low yes no-recurrence-events
211 40-49 ge40 25-29 12-14 yes 3 left right_low yes recurrence-events
212 40-49 premeno 40-44 0-2 no 1 left left_low no recurrence-events
213 40-49 premeno 20-24 0-2 no 2 left left_low no no-recurrence-events
214 50-59 ge40 25-29 0-2 no 1 left right_low no no-recurrence-events
215 40-49 premeno 20-24 0-2 no 2 right left_up no no-recurrence-events
216 70-79 ge40 40-44 0-2 no 1 right left_up no no-recurrence-events
217 60-69 ge40 25-29 0-2 no 3 left left_up no recurrence-events
218 50-59 premeno 25-29 0-2 no 2 left left_low no no-recurrence-events
219 60-69 ge40 45-49 0-2 no 1 right right_up yes recurrence-events
220 50-59 ge40 20-24 0-2 yes 2 right left_up no no-recurrence-events
221 50-59 ge40 25-29 0-2 no 1 left left_low no no-recurrence-events
222 50-59 ge40 20-24 0-2 no 3 left left_up no no-recurrence-events
223 40-49 premeno 20-24 3-5 no 2 right left_low no no-recurrence-events
224 50-59 ge40 35-39 0-2 no 2 left left_up no no-recurrence-events
225 30-39 premeno 20-24 0-2 no 3 left left_up yes recurrence-events
226 60-69 ge40 30-34 0-2 no 1 right left_up no no-recurrence-events
227 60-69 ge40 25-29 0-2 no 3 right left_low no no-recurrence-events
228 40-49 ge40 30-34 0-2 no 2 left left_up yes no-recurrence-events
229 30-39 premeno 25-29 0-2 no 2 left left_low no no-recurrence-events
230 40-49 premeno 20-24 0-2 no 2 left left_low no recurrence-events
231 30-39 premeno 20-24 0-2 no 2 left right_low no no-recurrence-events
232 40-49 premeno 10-14 0-2 no 2 right left_low no no-recurrence-events
233 50-59 premeno 15-19 0-2 no 2 right right_low no no-recurrence-events
234 50-59 premeno 25-29 0-2 no 1 right left_up no no-recurrence-events
235 60-69 ge40 20-24 0-2 no 2 right left_up no no-recurrence-events
236 60-69 ge40 40-44 0-2 no 2 right left_low no recurrence-events
237 30-39 lt40 15-19 0-2 no 3 right left_up no no-recurrence-events
238 40-49 premeno 30-34 12-14 yes 3 left left_up yes recurrence-events
239 60-69 ge40 30-34 0-2 yes 2 right right_up yes recurrence-events
240 50-59 ge40 40-44 6-8 yes 3 left left_low yes recurrence-events
241 50-59 ge40 30-34 0-2 no 3 left nan no recurrence-events
242 70-79 ge40 10-14 0-2 no 2 left central no no-recurrence-events
243 30-39 premeno 40-44 0-2 no 2 left left_low yes no-recurrence-events
244 40-49 premeno 30-34 0-2 no 2 right right_low no no-recurrence-events
245 40-49 premeno 30-34 0-2 no 1 left left_low no no-recurrence-events
246 60-69 ge40 15-19 0-2 no 2 left left_low no no-recurrence-events
247 40-49 premeno 10-14 0-2 no 2 left left_low no no-recurrence-events
248 60-69 ge40 20-24 0-2 no 1 left left_low no no-recurrence-events
249 50-59 ge40 10-14 0-2 no 1 left left_up no no-recurrence-events
250 50-59 premeno 25-29 0-2 no 1 left left_low no no-recurrence-events
251 50-59 ge40 30-34 9-11 yes 3 left right_low yes recurrence-events
252 50-59 ge40 10-14 0-2 no 2 left left_low no no-recurrence-events
253 40-49 premeno 30-34 0-2 no 1 left right_up no no-recurrence-events
254 70-79 ge40 0-4 0-2 no 1 left right_low no no-recurrence-events
255 40-49 premeno 25-29 0-2 no 3 right left_up yes no-recurrence-events
256 50-59 premeno 25-29 0-2 no 3 right left_low yes recurrence-events
257 50-59 ge40 40-44 0-2 no 2 left left_low no no-recurrence-events
258 60-69 ge40 25-29 0-2 no 3 left right_low yes recurrence-events
259 40-49 premeno 30-34 3-5 yes 2 right left_low no no-recurrence-events
260 50-59 ge40 20-24 0-2 no 2 left left_up no recurrence-events
261 70-79 ge40 20-24 0-2 no 3 left left_up no no-recurrence-events
262 30-39 premeno 25-29 0-2 no 1 left central no no-recurrence-events
263 60-69 ge40 30-34 0-2 no 2 left left_low no no-recurrence-events
264 40-49 premeno 20-24 3-5 yes 2 right right_up yes recurrence-events
265 50-59 ge40 30-34 9-11 nan 3 left left_low yes no-recurrence-events
266 50-59 ge40 0-4 0-2 no 2 left central no no-recurrence-events
267 40-49 premeno 20-24 0-2 no 3 right left_low yes no-recurrence-events
268 30-39 premeno 35-39 0-2 no 3 left left_low no recurrence-events
269 60-69 ge40 30-34 0-2 no 1 left left_up no no-recurrence-events
270 60-69 ge40 20-24 0-2 no 1 left left_low no no-recurrence-events
271 50-59 ge40 25-29 6-8 no 3 left left_low yes recurrence-events
272 50-59 premeno 35-39 15-17 yes 3 right right_up no recurrence-events
273 30-39 premeno 20-24 3-5 yes 2 right left_up yes no-recurrence-events
274 40-49 premeno 20-24 6-8 no 2 right left_low yes no-recurrence-events
275 50-59 ge40 35-39 0-2 no 3 left left_low no no-recurrence-events
276 50-59 premeno 35-39 0-2 no 2 right left_up no no-recurrence-events
277 40-49 premeno 25-29 0-2 no 2 left left_up yes no-recurrence-events
278 40-49 premeno 35-39 0-2 no 2 right right_up no no-recurrence-events
279 50-59 premeno 30-34 3-5 yes 2 left left_low yes no-recurrence-events
280 40-49 premeno 20-24 0-2 no 2 right right_up no no-recurrence-events
281 60-69 ge40 15-19 0-2 no 3 right left_up yes no-recurrence-events
282 50-59 ge40 30-34 6-8 yes 2 left left_low no no-recurrence-events
283 50-59 premeno 25-29 3-5 yes 2 left left_low yes no-recurrence-events
284 30-39 premeno 30-34 6-8 yes 2 right right_up no no-recurrence-events
285 50-59 premeno 15-19 0-2 no 2 right left_low no no-recurrence-events
286 50-59 ge40 40-44 0-2 no 3 left right_up no no-recurrence-events

73
subjects/ai/machine-learning-pipeline/data/breast_cancer_readme.txt

@ -0,0 +1,73 @@
Citation Request:
This breast cancer domain was obtained from the University Medical Centre,
Institute of Oncology, Ljubljana, Yugoslavia. Thanks go to M. Zwitter and
M. Soklic for providing the data. Please include this citation if you plan
to use this database.
1. Title: Breast cancer data (Michalski has used this)
2. Sources:
-- Matjaz Zwitter & Milan Soklic (physicians)
Institute of Oncology
University Medical Center
Ljubljana, Yugoslavia
-- Donors: Ming Tan and Jeff Schlimmer (Jeffrey.Schlimmer@a.gp.cs.cmu.edu)
-- Date: 11 July 1988
3. Past Usage: (Several: here are some)
-- Michalski,R.S., Mozetic,I., Hong,J., & Lavrac,N. (1986). The
Multi-Purpose Incremental Learning System AQ15 and its Testing
Application to Three Medical Domains. In Proceedings of the
Fifth National Conference on Artificial Intelligence, 1041-1045,
Philadelphia, PA: Morgan Kaufmann.
-- accuracy range: 66%-72%
-- Clark,P. & Niblett,T. (1987). Induction in Noisy Domains. In
Progress in Machine Learning (from the Proceedings of the 2nd
European Working Session on Learning), 11-30, Bled,
Yugoslavia: Sigma Press.
-- 8 test results given: 65%-72% accuracy range
-- Tan, M., & Eshelman, L. (1988). Using weighted networks to
represent classification knowledge in noisy domains. Proceedings
of the Fifth International Conference on Machine Learning, 121-134,
Ann Arbor, MI.
-- 4 systems tested: accuracy range was 68%-73.5%
-- Cestnik,G., Konenenko,I, & Bratko,I. (1987). Assistant-86: A
Knowledge-Elicitation Tool for Sophisticated Users. In I.Bratko
& N.Lavrac (Eds.) Progress in Machine Learning, 31-45, Sigma Press.
-- Assistant-86: 78% accuracy
4. Relevant Information:
This is one of three domains provided by the Oncology Institute
that has repeatedly appeared in the machine learning literature.
(See also lymphography and primary-tumor.)
This data set includes 201 instances of one class and 85 instances of
another class. The instances are described by 9 attributes, some of
which are linear and some are nominal.
5. Number of Instances: 286
6. Number of Attributes: 9 + the class attribute
7. Attribute Information:
1. Class: no-recurrence-events, recurrence-events
2. age: 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99.
3. menopause: lt40, ge40, premeno.
4. tumor-size: 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44,
45-49, 50-54, 55-59.
5. inv-nodes: 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26,
27-29, 30-32, 33-35, 36-39.
6. node-caps: yes, no.
7. deg-malig: 1, 2, 3.
8. breast: left, right.
9. breast-quad: left-up, left-low, right-up, right-low, central.
10. irradiat: yes, no.
8. Missing Attribute Values: (denoted by "?")
Attribute #: Number of instances with missing values:
6. 8
9. 1.
9. Class Distribution:
1. no-recurrence-events: 201 instances
2. recurrence-events: 85 instances

259
subjects/ai/model-selection-methodology/README.md

@ -0,0 +1,259 @@
# Model selection methodology
If you finished yesterday's exercises you should be able to train several Machine Learning algorithms and to choose one returned by GridSearchCV.
GridSearchCV returns the model that gives the best score on the test set. Yesterday, as I told you, I changed the **cv** parameter to compute the GridSearch with a train set and a test set.
It means that the selected model is based on one single measure. What if, by luck, we predict correctly on that section ? What if the best model is bad ? What if I could have selected a better model ?
We will answer these questions today ! The topics we will cover are the one of the most important in Machine Learning.
### Exercises of the day
- Exercise 0: Environment and libraries
- Exercise 1: K-Fold
- Exercise 2: Cross validation (k-fold)
- Exercise 3: GridsearchCV
- Exercise 4: Validation curve and Learning curve
### Virtual Environment
- Python 3.x
- NumPy
- Pandas
- Jupyter or JupyterLab
- Scikit-learn
- Matplotlib
*Version of Pandas I used to do the exercises: 1.0.1*.
I suggest to use the most recent one.
### **Resources**
**Must read before to start the exercises**
### Biais-Variance trade off, aka Underfitting/Overfitting:
- https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine-learning/
- https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html
### Cross-validation
- https://algotrading101.com/learn/train-test-split/
---
---
# Exercise 0: Environment and libraries
The goal of this exercise is to set up the Python work environment with the required libraries.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
I recommend to use:
- the **last stable versions** of Python.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy`, `jupyter`, `matplotlib` and `scikit-learn`.
---
---
# Exercise 1: K-Fold
The goal of this exercise is to learn to use `KFold` to split the data set in a k-fold cross validation. Most of the time you won't use this function to split your data because this function is used by others as `cross_val_score` or `cross_validate` or `GridSearchCV` ... . But, this allows to understand the splitting and to create a custom one if needed.
```python
X = np.array(np.arange(1,21).reshape(10,-1))
y = np.array(np.arange(1,11))
```
1. Using `KFold`, perform a 5-fold cross validation. For each fold, print the train index and test index. The expected output is:
```console
Fold: 1
TRAIN: [2 3 4 5 6 7 8 9] TEST: [0 1]
Fold: 2
TRAIN: [0 1 4 5 6 7 8 9] TEST: [2 3]
Fold: 3
TRAIN: [0 1 2 3 6 7 8 9] TEST: [4 5]
Fold: 4
TRAIN: [0 1 2 3 4 5 8 9] TEST: [6 7]
Fold: 5
TRAIN: [0 1 2 3 4 5 6 7] TEST: [8 9]
```
---
---
# Exercise 2: Cross validation (k-fold)
The goal of this exercise is to learn how to use cross validation. After reading the articles you should be able to explain why we need to cross-validate the models. We will firstly focus on Linear Regression to reduce the computation time. We will be using `cross_validate` to run the cross validation. Note that `cross_val_score` is similar but the `cross_validate` calculates one or more scores and timings for each CV split.
Preliminary:
- Import California Housing data set and split it in a train set and a test set (10%). Fit a linear regression on the data set. *The goal is to focus on the cross validation, that is why the code to fit the Linear Regression is given.*
```python
# imports
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
# data
housing = fetch_california_housing()
X, y = housing['data'], housing['target']
# split data train test
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.1,
shuffle=True,
random_state=43)
# pipeline
pipeline = [('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
('lr', LinearRegression())]
pipe = Pipeline(pipeline)
```
1. Cross validate the Pipeline using `cross_validate` with 10 folds. Print the scores on each validation sets, the mean score on the validation sets and the standard deviation on the validation sets. The expected output is:
```console
Scores on validation sets:
[0.62433594 0.61648956 0.62486602 0.59891024 0.59284295 0.61307055
0.54630341 0.60742976 0.60014575 0.59574508]
Mean of scores on validation sets:
0.60201392526743
Standard deviation of scores on validation sets:
0.0214983822773466
```
**Note: It may be confusing that the key of the dictionary that returns the results on the validation sets is `test_score`. Sometimes, the validation sets are called test sets. In that case, we run the cross validation on X_train. It means that the scores are computed on sets in the initial train set. The X_test is not used for the cross-validation.**
- https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html
- https://machinelearningmastery.com/how-to-configure-k-fold-cross-validation/
---
---
# Exercise 3: GridsearchCV
The goal of this exercise is to learn to use GridSearchCV to run a grid search, predict on the test set and score on the test set.
Preliminary:
- Import California Housing data set and split it in a train set and a test set (10%). Fit a linear regression on the data set. *The goal is to focus on the gridsearch, that is why the code to fit the Linear Regression is given.*
```python
# imports
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
# data
housing = fetch_california_housing()
X, y = housing['data'], housing['target']
# split data train test
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.1,
shuffle=True,
random_state=43)
# pipeline
pipeline = [('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
('lr', LinearRegression())]
pipe = Pipeline(pipeline)
```
1. Run `GridSearchCV` on all CPUs with 5 folds, MSE as score, Random Forest as model with:
- max_depth between 1 and 20 (at least 3 values)
- n_estimators between 1 and 100 (at least 3 values)
This may take few minutes to run.
*Hint*: The name of the metric to put in the parameter `scoring` is `neg_mean_squared_error`. The smaller the MSE is, the better the model is. At the contrary, The greater the R2 is the better the model is. `GridSearchCV` chooses the best model by selecting the one that maximized the score on the validation sets. And, in mathematic, maximizing a function or minimizing its opposite is equivalent. More details:
- https://stackoverflow.com/questions/21443865/scikit-learn-cross-validation-negative-values-with-mean-squared-error
2. Extract the best fitted estimator, print its params, print its score on the validation set and print `cv_results_`.
3. Compute the score the test set.
**WARNING: If the score used in classification is the AUC, there is one rare case where the AUC may return an error or a warning: The fold contains only one class. In that case it can't be computed, by definition.**
---
---
# Exercise 4: Validation curve and Learning curve
The goal of this exercise is to learn to analyse the model's performance with two tools:
- Validation curve
- Learning curve
For this exercise we will use a dataset of 100k data points to give you an idea of the computation time you can expect during projects.
Preliminary:
- Using make_classification from sklearn, generate a binary data set with 100k data points and with 30 features.
```python
X, y = make_classification(n_samples=100000,
n_features= 30,
n_informative=10,
flip_y=0.2 )
```
1. Plot the validation curve, using all CPUs, with 5 folds. The goal is to focus again on max_depth between 1 and 20.
You may need to increase the window (example: between 1 and 50 ) if you notice that other values of max_depth could have returned better results. This may take few minutes.
I do not expect that you implement all the plot from scratch, you'd better leverage the code here:
- https://scikit-learn.org/stable/auto_examples/model_selection/plot_validation_curve
The plot should look like this:
![alt text][logo_ex5q1]
[logo_ex5q1]: ./w2_day5_ex5_q1.png "Validation curve "
The interpretation is that from max_depth=10, the train score keeps increasing but the test score (or validation score) reaches a plateau. It means that choosing max_depth = 20 may lead to have an over fitted model.
Note: Given the time computation is is not possible to plot the validation curve for all parameters. It is useful to plot it for parameters that control the over fitting the most.
More details:
- https://chrisalbon.com/machine_learning/model_evaluation/plot_the_validation_curve/
2. Let us assume the gridsearch returned `clf = RandomForestClassifier(max_depth=12)`. Let's check if the models under fits, over fit or fits correctly. Plot the learning curve. These two resources will help you a lot to understand how to analyse the learning curves and how to plot them:
- https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/
- https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html#sphx-glr-auto-examples-model-selection-plot-learning-curve-py
- **Re-use the function in the second resource**, change the cross validation to a classic 10-folds, run the learning curve data computation on all CPUs and plot the three plots as shown below.
![alt text][logo_ex5q2]
[logo_ex5q2]: ./w2_day5_ex5_q2.png "Learning curve "
- **Note Plot Learning Curves**: The learning curves is detailed in the first resource.
- **Note Plot Scalability of the model**: This plot shows the relationship between the time to train the model and the number of rows in the data. In that case the relationship is linear.
- **Note Performance of the model**: This plot shows wether it worths increasing the training time by adding data to increase the score. It would worth to add data to increase the score if the curve hasn't reach a plateau yet. In that case, increasing the training time by 10 units increases the score by less than 0.001.

131
subjects/ai/model-selection-methodology/audit/README.md

@ -0,0 +1,131 @@
#### Exercise 0: Environment and libraries
##### The exercise is validated is all questions of the exercise are validated.
##### Activate the virtual environment. If you used `conda` run `conda activate your_env`.
##### Run `python --version`.
###### Does it print `Python 3.x`? x >= 8
##### Does `import jupyter`, `import numpy`, `import pandas`, `import matplotlib` and `import sklearn` run without any error?
---
---
#### Exercise 1: K-Fold
##### The question 1 is validated if the output of the 5-fold cross validation is:
```console
Fold: 1
TRAIN: [2 3 4 5 6 7 8 9] TEST: [0 1]
Fold: 2
TRAIN: [0 1 4 5 6 7 8 9] TEST: [2 3]
Fold: 3
TRAIN: [0 1 2 3 6 7 8 9] TEST: [4 5]
Fold: 4
TRAIN: [0 1 2 3 4 5 8 9] TEST: [6 7]
Fold: 5
TRAIN: [0 1 2 3 4 5 6 7] TEST: [8 9]
```
---
---
#### Exercise 2: Cross validation (k-fold)
##### The question 1 is validated if the output is:
```console
Scores on validation sets:
[0.62433594 0.61648956 0.62486602 0.59891024 0.59284295 0.61307055
0.54630341 0.60742976 0.60014575 0.59574508]
Mean of scores on validation sets:
0.60201392526743
Standard deviation of scores on validation sets:
0.0214983822773466
```
The model is consistent across folds: it is stable. That's a first sign that the model is not over-fitted. The average R2 is 60% that's a good start ! To be improved...
---
---
#### Exercise 3: GridsearchCV
##### The exercise is validated is all questions of the exercise are validated
##### The question 1 is validated if the code that runs the grid search is similar to:
```python
parameters = {'n_estimators':[10, 50, 75],
'max_depth':[4, 7, 10]}
rf = RandomForestRegressor()
gridsearch = GridSearchCV(rf,
parameters,
cv = 5,
n_jobs=-1,
scoring='neg_mean_squared_error')
gridsearch.fit(X_train, y_train)
```
The answers that uses another list of parameters are accepted too !
##### The question 2 is validated if these attributes were used:
```python
print(gridsearch.best_score_)
print(gridsearch.best_params_)
print(gridsearch.cv_results_)
```
The best score is -0.29028202683007526, that means that the MSE is ~0.29, it doesn't give any information since this metric is arbitrary. This score is the average of `neg_mean_squared_error` on all the validation sets.
The best models params are `{'max_depth': 10, 'n_estimators': 75}`.
Note that if the parameters used are different, the results should be different.
##### The question 3 is validated if the fitted estimator was used to compute the score on the test set: `gridsearch.score(X_test, y_test)`. The MSE score is ~0.27. The score I got on the test set is close to the score I got on the validation sets. It means the models is not over fitted.
---
---
#### Exercise 4: Validation curve and Learning curve
##### The question 1 is validated if the outputted plot looks like the plot below. The two important points to check are: The training score has to converge towards `1` and the cross-validation score reaches a plateau around `0.9` from `max_depth = 10`
![alt text][logo_ex5q1]
[logo_ex5q1]: ../w2_day5_ex5_q1.png "Validation curve "
The code that generated the data in the plot is:
```python
from sklearn.model_selection import validation_curve
clf = RandomForestClassifier()
param_range = np.arange(1,30,2)
train_scores, test_scores = validation_curve(clf,
X,
y,
param_name="max_depth",
param_range=param_range,
scoring="roc_auc",
n_jobs=-1)
```
##### The question 2 is validated if the outputted plots looks like:
![alt text][logo_ex5q2]
[logo_ex5q2]: ../w2_day5_ex5_q2.png "Learning curve "

BIN
subjects/ai/model-selection-methodology/w2_day5_ex5_q1.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 66 KiB

BIN
subjects/ai/model-selection-methodology/w2_day5_ex5_q2.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 176 KiB

162
subjects/ai/natural-language-processing-with-spacy/README.md

@ -0,0 +1,162 @@
# Natural Language processing with Spacy
Spacy is a natural language processing (NLP) library for Python designed to have fast performance, and with word embedding models built in, it’s perfect for a quick and easy start. I don't need to detail what spaCy does, it is perfectly summarized by spaCy in this article: **spaCy 101: Everything you need to know**.
Today, we will learn to use a pre-trained embedding to convert a text into a vector to compute similarity between words or sentences. Remember, embeddings translate large sparse vectors into a lower-dimensional space that preserves semantic relationships.
Word embeddings is a technique where individual words of a domain or language are represented as real-valued vectors in a lower dimensional space. The BoW representation's dimension depends on the size of the vocabulary. But it can easily reach 10k words. We will also learn to use NER and Part-of-speech. NER allows to identify and segment the named entities and classify or categorize them under various predefined classes. Part-of-speech is a special label assigned to each token (word) in a text corpus to indicate the part of speech and often also other grammatical categories such as tense, number (plural/singular), case etc.
### Exercises of the day
- Exercise 0: Environment and libraries
- Exercise 1: Embedding 1
- Exercise 2: Tokenization
- Exercise 3: Embeddings 2
- Exercise 4: Sentences' similarity
- Exercise 5: NER
- Exercise 6: Part-of-speech tags
### Virtual Environment
- Python 3.x
- Jupyter or JupyterLab
- Pandas
- Spacy
- Scikit-learn
- Matplotlib
I suggest to use the most recent libraries.
### **Resources**
- https://spacy.io/usage/spacy-101
- https://spacy.io/api/doc
- https://www.analyticsvidhya.com/blog/2021/06/nlp-application-named-entity-recognition-ner-in-python-with-spacy/
- https://medium.com/mlearning-ai/nlp-04-part-of-speech-tagging-in-spacy-dc3e239c2726
---
---
# Exercise 0: Environment and libraries
The goal of this exercise is to set up the Python work environment with the required libraries.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
I recommend to use:
- the **last stable versions** of Python.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
1. Create a virtual environment named with a version of Python >= `3.8`, with the following libraries: `pandas`, `jupyter`, `spacy`, `sklearn`, `matplotlib`.
---
---
# Exercise 1: Embedding 1
The goal of this exercise is to learn to load an embedding on SpaCy.
1. Install and load `en_core_web_sm` version `3.4.0` [embedding](https://github.com/explosion/spacy-models/releases/tag/en_core_web_sm-3.4.0). Compute the embedding of `car`.
---
---
# Exercise 2: Tokenization
The goal of this exercise is to learn to tokenize a document using Spacy. We did this using NLTK yesterday.
1. Tokenize the text below and print the tokens
```
text = "Tokenize this sentence. And this one too."
```
---
---
# Exercise 3: Embeddings 2
The goal of this exercise is to learn to use SpaCy embedding on a document.
1. Compute the embedding of all the words in this sentence. The language model considered is `en_core_web_md`
```
"laptop computer coffee tea water liquid dog cat kitty"
```
2. Plot the pairwise cosine distances between all the words in a HeatMap.
![alt text][logo]
[logo]: ./w3day05ex1_plot.png "Plot"
https://medium.com/datadriveninvestor/cosine-similarity-cosine-distance-6571387f9bf8
---
---
# Exercise 4: Sentences' similarity
The goal of this exerice is to learn to compute the similarity between two sentences. As explained in the documentation: **The word embedding of a full sentence is simply the average over all different words**. This is how `similarity` works in SpaCy. This small use case is very interesting because if we build a corpus of sentences that express an intention as **buy shoes**, then we can detect this intention and use it to propose shoes advertisement for customers. The language model used in this exercise is `en_core_web_sm`.
1. Compute the similarities (3 in total) between these sentences:
```
sentence_1 = "I want to buy shoes"
sentence_2 = "I would love to purchase running shoes"
sentence_3 = "I am in my room"
```
---
---
# Exercise 5: NER
The goal of this exercise is to learn to use a Named entity recognition algorithm to detect entities.
```
Apple Inc. is an American multinational technology company headquartered in Cupertino, California, that designs, develops, and sells consumer electronics, computer software, and online services. It is considered one of the Big Five companies in the U.S. information technology industry, along with Amazon, Google, Microsoft, and Facebook.
Apple was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in April 1976 to develop and sell Wozniak's Apple I personal computer, though Wayne sold his share back within 12 days. It was incorporated as Apple Computer, Inc., in January 1977, and sales of its computers, including the Apple I and Apple II, grew quickly.
```
1. Extract all named entities in the text as well as the label of the named entity.
2. The NER is also useful to remove ambigous entities. From a conceptual standpoint, disambiguation is the process of determining the most probable meaning of a specific phrase. For example in the sentence below, the word `apple` is present twice in the sentence. The first time to mention the fruit and the second to mention a company. Run the NER on this sentence and print the named entity, the `start_char`, the `end_char` and the label of the named entity.
```
Paul eats an apple while watching a movie on his Apple device.
```
https://en.wikipedia.org/wiki/Named-entity_recognition
---
---
# Exercise 6: Part-of-speech tags
The goal od this exercise is to learn to use the Part-of-speech tags (**POS TAG**) using Spacy. As explained in wikipedia, the POS TAG is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context.
Example
The sentence: **"Heat water in a large vessel"** is tagged this way after the POS TAG:
- heat verb (noun)
- water noun (verb)
- in prep (noun, adv)
- a det (noun)
- large adj (noun)
- vessel noun
The data `news_amazon.txt` used is a news paper about Amazon.
1. Return all sentences mentioning **Bezos** as a NNP (tag).

148
subjects/ai/natural-language-processing-with-spacy/audit/README.md

@ -0,0 +1,148 @@
#### Exercise 0: Environment and libraries
##### The exercise is validated is all questions of the exercise are validated.
##### Activate the virtual environment. If you used `conda` run `conda activate your_env`.
##### Run `python --version`.
###### Does it print `Python 3.x`? x >= 8
##### Do `import jupyter`, `import pandas` and `import spacy` run without any error?
---
---
#### Exercise 1: Embedding 1
##### The question 1 is validated if the embedding's shape is `(96,)`
##### The question 1 is validated if the 20 first values of the vector sum to `2.9790137708187103`
---
---
#### Exercise 2: Tokenization
##### The question 1 is validated if the tokens printed are:
```
Tokenize
this
sentence
.
And
this
one
too
.
```
---
---
#### Exercise 3: Embeddings 2
##### The exercice is validated is all questions of the exercice are validated
##### The question 1 is validated if the embeddings of each word has a shape of `(300,)` and if the first 20 values of the embedding of laptop are:
```
array([-0.37639 , -0.075521, 0.4908 , 0.19863 , -0.11088 , -0.076145,
-0.30367 , -0.69663 , 0.87048 , 0.54388 , 0.42523 , 0.18045 ,
-0.4358 , -0.32606 , -0.70702 , -0.069127, -0.42674 , 2.4147 ,
0.26806 , 0.46584 ], dtype=float32)
```
##### The question 2 is validated if the output is
![alt text][logo]
[logo]: ../w3day05ex1_plot.png "Plot"
---
---
#### Exercise 4: Sentences' similarity
##### The question 1 is validated if the similarities between the sentences are:
```
sentence_1 <=> sentence 2 : 0.7073220863266589
sentence_1 <=> sentence 3: 0.42663743263528325
sentence_2 <=> sentence 3: 0.3336274235605957
```
---
---
#### Exercise 5: NER
##### The exercice is validated is all questions of the exercice are validated
##### The question 1 is validated if the ouptut of the NER is
```
Apple Inc. ORG
American NORP
Cupertino GPE
California GPE
Five CARDINAL
U.S. GPE
Amazon ORG
Google ORG
Microsoft ORG
Facebook ORG
Apple ORG
Steve Jobs PERSON
Steve Wozniak PERSON
Ronald Wayne PERSON
April 1976 DATE
Wozniak PERSON
Apple ORG
Wayne PERSON
12 days DATE
Apple Computer, Inc. ORG
January 1977 DATE
Apple ORG
Apple II ORG
```
##### The question 2 is validated if the output shows that the first occurence of apple is not a named entity. In my case here is what the NER returns:
```
Paul 1 5 PERSON
Apple 50 55 ORG
```
---
---
#### Exercise 6: Part-of-speech tags
##### The question 1 is validated if the sentences outputed are:
```
INFO: Bezos PROPN NNP
Sentence: Amazon (AMZN) enters 2021 with plenty of big opportunities, but is losing its lauded Chief Executive Jeff Bezos, who announced his plan to step aside in the third quarter.
INFO: Bezos PROPN NNP
Sentence: Bezos will hand off his role as chief executive to Andy Jassy, the CEO of its cloud computing unit.
INFO: Bezos PROPN NNP
Sentence: He's not leaving, as Bezos will transition to the role of Executive Chairman and remain active.
INFO: Bezos PROPN NNP
Sentence: "When you look at our financial results, what you're actually seeing are the long-run cumulative results of invention," Bezos said in written remarks with the Amazon earnings release.
```

18
subjects/ai/natural-language-processing-with-spacy/resources/news_amazon.txt

@ -0,0 +1,18 @@
Amazon (AMZN) enters 2021 with plenty of big opportunities, but is losing its lauded Chief Executive Jeff Bezos, who announced his plan to step aside in the third quarter. Given that, is Amazon stock a buy?
Bezos will hand off his role as chief executive to Andy Jassy, the CEO of its cloud computing unit. He's not leaving, as Bezos will transition to the role of Executive Chairman and remain active. But one Wall Street analyst thinks the changing of the guard is further evidence of a pivotal shift in Amazon's long-term strategy that will alter the course of its business operations.
D.A. Davidson analyst Tom Forte believes that Amazon is steadily evolving into being a much larger services company. The idea is, rather than disrupt and destroy businesses as it has frequently done, Amazon will instead try to help them. They'll cooperate with Amazon but still be in competition with it. It's a business strategy referred to as "coopetition."
"It's the newer, kinder, gentler Amazon," Forte told Investor's Business Daily. "They'll help you make more money rather than take away all your sales."
The massive warehousing and shipping abilities Amazon continues to build would be made more widely available to other businesses. Not only would Amazon help them advertise and sell products, it might also provide logistics services, storage and shipping. Also, if a company needs video-streaming help, Amazon Video has them covered. And its cloud services operation, the largest worldwide, is already used by thousands of businesses large and small.
Amazon Stock Gets Boost From Sizzling Earnings Report
The management shift was announced the same day Amazon turned in a sizzling fourth-quarter earnings report that smashed expectations. Earnings were double what analysts expected. And company revenue in the quarter broke above $100 billion for the first time.
Amazon carries plenty of big opportunities into 2021, providing plenty of fuel for continued growth in Amazon stock.
"When you look at our financial results, what you're actually seeing are the long-run cumulative results of invention," Bezos said in written remarks with the Amazon earnings release. "Right now I see Amazon at its most inventive ever, making it an optimal time for this transition."
There are other compelling reasons why Amazon would want to go down this path. The service business is highly profitable. Moreover, the new Amazon could ease the pressure it's receiving in Congress over regulatory and antitrust concerns. Congress increasingly views Amazon as an overwhelming threat to fair competition.

BIN
subjects/ai/natural-language-processing-with-spacy/w3day05ex1_plot.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 33 KiB

218
subjects/ai/natural-language-processing/README.md

@ -0,0 +1,218 @@
# Natural Language processing
“NLP makes it possible for humans to talk to machines:” This branch of AI enables computers to understand, interpret, and manipulate human language. This technology is one of the most broadly applied areas of machine learning and is critical in effectively analyzing massive quantities of unstructured, text-heavy data.
Machine learning algorithms cannot work with raw text directly. Rather, the text must be converted into vectors of numbers. In natural language processing, a common technique for extracting features from text is to place all of the words that occur in the text in an unordered bucket. This aproach is called a bag of words model or BoW for short. It’s referred to as a “bag” of words because any information about the structure of the sentence is lost. This is useful to train usual machine learning models on text data. Other types of models as RNNs or LSTMs take as input a complete and ordered sequence.
Almost every Natural Language Processing (NLP) task requires text to be preprocessed before training a model. The article **Your Guide to Natural Language Processing (NLP)** gives a very good introduction to NLP.
Today, we we will learn to preprocess text data and to create a bag of word representation. Les packages NLTK and Spacy to do the preprocessing
### Exercises of the day
- Exercise 0: Environment and libraries
- Exercise 1: Lower case
- Exercise 2: Punctuation
- Exercise 3: Tokenization
- Exercise 4: Stop words
- Exercise 5: Stemming
- Exercise 6: Text preprocessing
- Exercise 7: Bag of Word representation
### Virtual Environment
- Python 3.x
- Jupyter or JupyterLab
- Pandas
- Scikit Learn
- NLTK
I suggest to use the most recent libraries.
### **Resources**
- https://towardsdatascience.com/your-guide-to-natural-language-processing-nlp-48ea2511f6e1
- https://towardsdatascience.com/word-embeddings-for-nlp-5b72991e01d4
---
---
# Exercise 0: Environment and libraries
The goal of this exercise is to set up the Python work environment with the required libraries.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
I recommend to use:
- the **last stable versions** of Python.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
1. Create a virtual environment named with a version of Python >= `3.8`, with the following libraries: `pandas`, `jupyter`, `nltk` and `scikit-learn`.
---
---
# Exercise 1: Lowercase
The goal of this exercise is to learn to lowercase text data in Python. Note that if the volume of data is low the text data can be stored in a Pandas DataFrame or Series. But, when dealing with high volumes (high but not huge), using a Pandas DataFrame or Series is not efficient. Data structures as dictionaries or list are more adapted.
```
list_ = ["This is my first NLP exercise", "wtf!!!!!"]
series_data = pd.Series(list_, name='text')
```
1. Print all texts in lowercase
2. Print all texts in upper case
Note: Do not change the text manually!
---
---
# Exerice 2: Punctuation
The goal of this exerice is to learn to deal with punctuation. In Natural Language Processing, some basic approaches as Bag of Words model the text as an unordered combination of words. In that case the punctuation is not always useful as it doesn't add information to the model. That is why is removed.
1. Remove the punctuation from this sentence. All characters in !"#$%&'()\*+,-./:;<=>?@[\]^\_`{|}~ are considered as punctuation.
```
Remove, this from .? the sentence !!!! !"#&'()*+,-./:;<=>_
```
---
---
# Exercise 3: Tokenization
The goal of this exercise is to learn to tokenize as text. This step is important because it splits the text into token. A token could be a sentence or a word.
```
text = """Bitcoin is a cryptocurrency invented in 2008 by an unknown person or group of people using the name Satoshi Nakamoto. The currency began use in 2009 when its implementation was released as open-source software."""
```
1. Tokenize this text using `sent_tokenize` from NLTK.
2. Tokenize this text using `word_tokenize` from NLTK.
_Ressource_:
https://www.analyticsvidhya.com/blog/2019/07/how-get-started-nlp-6-unique-ways-perform-tokenization/
---
---
# Exercise 4: Stop words
The goal of this exercise is to learn to remove stop words with NLTK. Stop words usually refers to the most common words in a language. For example: "and", "is", "a" are stop words and do not add information to a sentence.
```
text = """
The goal of this exercise is to learn to remove stop words with NLTK. Stop words usually refers to the most common words in a language.
"""
```
1. Remove stop words from this sentence and return the list of work tokens without stop words.
---
---
# Exercise 5: Stemming
The goal of this exercise is to learn to use stemming using NLTK. As explained in details in the article, stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language.
Note: The output of a stemmer is a word that may not exist in the dictionnary.
```
text = """
The interviewer interviews the president in an interview
"""
```
1. Return the list of the stemmed tokens.
---
---
# Exercise 6: Text preprocessing
The goal of this exercise is to learn to create a function to prepocess and clean a text using NLTK.
Put this text in a variable:
```
01 Edu System presents an innovative curriculum in software engineering and programming. With a renowned industry-leading reputation, the curriculum has been rigorously designed for learning skills of the digital world and technology industry. Taking a different approach than the classic teaching methods today, learning is facilitated through a collective and co-créative process in a professional environment.
```
1. Write a function that takes as input the text and returns it preprocessed.
The preprocessing is composed of:
1. Lowercase
2. Removing Punctuation
3. Tokenization
4. Stopword Filtering
5. Stemming
_Ressources: https://towardsdatascience.com/nlp-preprocessing-with-nltk-3c04ee00edc0_
---
---
# Exercise 7: Bag of Word representation
The goal of this exercise is to understand how to create a Bag of Word (BoW) model on a corpus of texts. More precesily we will create a labeled data set from textual data using a word count matrix.
_Ressources: https://machinelearningmastery.com/gentle-introduction-bag-words-model/_
As explained in the ressource, the Bag of word reprensation makes the assumption that the order in which the words appear in a text doesn't matter. There are different types of Bag of words reprensations:
- Boolean: Each document is a boolean vector
- Wordcount: Each document is a word count vector
- TFIDF: Each document is a score vector. The score is detailed in the next exercise.
The data `tweets_train.txt` contains tweets labeled with a sentiment. It gives the positivity of a tweet.
Steps:
1. Preprocess the data using the function implemented in the previous exercise. And, using from `CountVectorizer` of scikitlearn with `max_features=500` compute the wordcount of the tweets. The output is a sparse matrix.
- Check the shape of the word count matrix
- Set **max_features** to 500 of the initial size of the dictionnary.
Reminder: Given that a data set is often described as an m x n matrix in which m is the number of rows and n is the number of columns: features. It is strongly recommanded to work with m >> n. The value of the ratio depends on the signal existing in the data set and on the model complexity.
2. Using from_spmatrix from scikitlearn create a DataFrame with documents in rows and dictionary in columns.
| | and | boat | compute |
| --: | --: | ---: | ------: |
| 0 | 0 | 2 | 0 |
| 1 | 0 | 0 | 1 |
| 2 | 1 | 0 | 0 |
3. Create a dataframe with the labels
- 1: positive
- 0: neutral
- -1: negative
| | target |
| --: | -----: |
| 0 | -1 |
| 1 | 0 |
| 2 | 1 |
_Ressources: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html_

208
subjects/ai/natural-language-processing/audit/README.md

@ -0,0 +1,208 @@
#### Exercise 0: Environment and libraries
##### The exercise is validated is all questions of the exercise are validated
##### Activate the virtual environment. If you used `conda` run `conda activate your_env`.
##### Run `python --version`.
###### Does it print `Python 3.x`? x >= 8
##### Does `import jupyter`, `import pandas`, `import nltk` and `import sklearn` run without any error?
---
---
#### Exercise 1: Lower case
##### The exercise is validated is all questions of the exercise are validated
##### The question 1 is validated if the output is:
```
0 this is my first nlp exercise
1 wtf!!!!!
Name: text, dtype: object
```
##### The question 2 is validated if the output is:
```
0 THIS IS MY FIRST NLP EXERCISE
1 WTF!!!!!
Name: text, dtype: object
```
---
---
#### Exercise 2: Punctuation
##### The question 1 is validated if the ouptut doesn't contain punctuation `` !"#$%&'()*+,-./:;<=>?@[]^_`{|}~ ``. Do not take into account the spaces in the output. The output should be as:
```
Remove this from the sentence
```
---
---
#### Exercise 3: Tokenization
##### The exercise is validated is all questions of the exercise are validated
##### The question 1 is validated if the ouptut is:
```
['Bitcoin is a cryptocurrency invented in 2008 by an unknown person or group of people using the name Satoshi Nakamoto.',
'The currency began use in 2009 when its implementation was released as open-source software.']
```
##### The question 2 is validated if the ouptut is:
```
['Bitcoin',
'is',
'a',
'cryptocurrency',
'invented',
'in',
'2008',
'by',
'an',
'unknown',
'person',
'or',
'group',
'of',
'people',
'using',
'the',
'name',
'Satoshi',
'Nakamoto',
'.',
'The',
'currency',
'began',
'use',
'in',
'2009',
'when',
'its',
'implementation',
'was',
'released',
'as',
'open-source',
'software',
'.']
```
---
---
#### Exercise 4: Stop words
##### The question 1 is validated if, using NLTK, the ouptut is:
```
['The', 'goal', 'exercise', 'learn', 'remove', 'stop', 'words', 'NLTK', '.', 'Stop', 'words', 'usually', 'refers', 'common', 'words', 'language', '.']
```
---
---
#### Exercise 5: Stemming
##### The question 1 is validated if, using NLTK, the output is:
```
['the', 'interview', 'interview', 'the', 'presid', 'in', 'an', 'interview']
```
---
---
#### Exercise 6: Text preprocessing
##### The question 1 is validated if the output is:
```
['01',
'edu',
'system',
'present',
'innov',
'curriculum',
'softwar',
'engin',
'program',
'renown',
'industrylead',
'reput',
'curriculum',
'rigor',
'design',
'learn',
'skill',
'digit',
'world',
'technolog',
'industri',
'take',
'differ',
'approach',
'classic',
'teach',
'method',
'today',
'learn',
'facilit',
'collect',
'cocré',
'process',
'profession',
'environ']
```
---
---
#### Exercise 7: Bag of Word representation
##### The exercice is validated is all questions of the exercice are validated
##### The question 1 is validated if the output of the CountVectorizer is
```
<6588x500 sparse matrix of type '<class 'numpy.int64'>'
with 79709 stored elements in Compressed Sparse Row format>
```
##### The question 2 is validated if the output of `print(df.iloc[:3,400:403].to_markdown())` is:
| | talk | team | tell |
|---:|-------:|-------:|-------:|
| 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 |
##### The question 3 is validated if the shape of the wordcount DataFrame `(6588, 501)` is and if the output of `print(df.iloc[300:304,499:501].to_markdown())` is:
| | youtube | label |
|----:|----------:|--------:|
| 300 | 0 | 0 |
| 301 | 0 | -1 |
| 302 | 1 | 0 |
| 303 | 0 | 1 |

274
subjects/ai/neural-networks/README.md

@ -0,0 +1,274 @@
# Neural Networks
Last week you learnt about some Machine Learning algorithms as Random Forest or Gradient Boosting. Neural Networks are another type of Machine Learning algorithms that are intensively used because of their efficiency. Neural networks are a set of algorithms, modeled loosely after the human brain, that are designed to recognize patterns. They interpret sensory data through a kind of machine perception, labeling or clustering raw input. The patterns they recognize are numerical, contained in vectors, into which all real-world data, be it images, sound, text or time series, must be translated. Different types of neural networks exist and are specific to some use-cases. For example CNN for images, RNN or LSTMs for time-series or text, etc ...
Today we will focus on Artificial Neural Networks. The goal is to understand how do the neural networks work, train them on data and understand the challenges of training a neural network. The ressources below expalin very well the mecanisms behind neural networks, step by step.
However the exercices won't cover architectures as RNN, LSTM - used on sequences as time series or text, CNN - used a lot on images processing. One of the projects will require to know how to use the special architectures. To do so, I suggest that you go through this lesson: https://fr.coursera.org/specializations/deep-learning.
### Exercises of the day
- Exercise 0: Environment and libraries
- Exercise 1: The neuron
- Exercise 2: Neural network
- Exercise 3: Log loss
- Exercise 4: Forward propagation
- Exercise 5: Regression
### Virtual Environment
- Python 3.x
- NumPy
- Jupyter or JupyterLab
_Version of NumPy I used to do the exercises: 1.18.1_.
I suggest to use the most recent one.
### **Resources**
- https://victorzhou.com/blog/intro-to-neural-networks/
- https://srnghn.medium.com/deep-learning-overview-of-neurons-and-activation-functions-1d98286cf1e4
- https://towardsdatascience.com/machine-learning-for-beginners-an-introduction-to-neural-networks-d49f22d238f9
---
---
# Exercise 0: Environment and libraries
The goal of this exercise is to set up the Python work environment with the required libraries.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
I recommend to use:
- the **last stable versions** of Python.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
1. Create a virtual environment with a version of Python >= `3.8`, with the following libraries: `numpy` and `jupyter`.
---
---
# Exercise 1: The neuron
The goal of this exercise is to understand the role of a neuron and to implement a neuron.
An artificial neuron, the basic unit of the neural network, (also referred to as a perceptron) is a mathematical function. It takes one or more inputs that are multiplied by values called “weights” and added together. This value is then passed to a non-linear function, known as an activation function, to become the neuron’s output.
As desbribed in the article, **a neuron takes inputs, does some math with them, and produces one output**.
Let us assume there are 2 inputs. Here are the three steps involved in the neuron:
1. Each input is multiplied by a weight
- x1 -> x1 \* w1
- x2 -> x2 \* w2
2. The weighted inputs are added together with a biais b
- (x1 _ w1) + (x2 _ w2) + b
3. The sum is passed through an activation function
- y = f((x1 _ w1) + (x2 _ w2) + b)
- The activation function is a function you know from W2DAY2 (Logistic Regression): **the sigmoid**
Example:
x1 = 2 , x2 = 3 , w1 = 0, w2= 1, b = 4
1. Step 1: Multiply by a weight
- x1 -> 2 \* 0 = 0
- x2 -> 3 \* 1 = 3
2. Step 2: Add weigthed inputs and bias
- 0 + 3 + 4 = 7
3. Step 3: Activation function
- y = f(7) = 0.999
---
1. Implement a the function feedforward of the class `Neuron` that takes as input the inputs (x1, x2) and that uses the attributes: the weights and the biais to return y:
```
class Neuron:
def __init__(self, weight1, weight2, bias):
self.weights_1 = weight1
self.weights_2 = weight2
self.bias = bias
def feedforward(cls, x1, x2):
#TODO
return y
```
Note: if you are confortable with matrix multiplication, feel free to vectorize the operations as done in the article.
https://victorzhou.com/blog/intro-to-neural-networks/
---
---
# Exerice 2: Neural network
The goal of this exercise is to understand how to combine three neurons to form a neural network. A neural newtwork is nothing else than neurons connected together. As shown in the figure the neural network is composed of **layers**:
- Input layer: it only represents input data. **It doesn't contain neurons**.
- Output layer: it represents the last layer. It contains a neuron (in some cases more than 1).
- Hidden layer: any layer between the input (first) layer and output (last) layer. Many hidden layers can be stacked. When there are many hidden layers, the neural networks is deep.
Notice that the neuron **o1** in the output layer takes as input the output of the neurons **h1** and **h2** in the hidden layer.
In exercise 1, you implemented this neuron.
![alt text][neuron]
[neuron]: ./w3_day1_neuron.png "Plot"
Now, we add two more neurons:
- h2, the second neuron of the hidden layer
- o1, the neuron of the output layer
![alt text][nn]
[nn]: ./w3_day1_neural_network.png "Plot"
1. Implement the function `feedforward` of the class `OurNeuralNetwork` that takes as input the input data and returns the output y. Return the output for these neurons:
```
neuron_h1 = Neuron(1,2,-1)
neuron_h2 = Neuron(0.5,1,0)
neuron_o1 = Neuron(2,0,1)
```
```
class OurNeuralNetwork:
def __init__(self, neuron_h1, neuron_h2, neuron_o1):
self.h1 = neuron_h1
self.h2 = neuron_h2
self.o1 = neuron_o1
def feedforward(self, x1, x2):
# The inputs for o1 are the outputs from h1 and h2
# TODO
return y
```
---
---
# Exercise 3: Log loss
The goal of this exercise is to implement the Log loss function. As mentioned last week, this function is used in classification as a **loss function**. It means that the better the classifier is, the smaller the loss function is. W2D1, you implemented the gradient descent on the MSE loss to update the weights of the linear regression. Similarly, the minimization of the Log loss leads to finding optimal weights.
Log loss: - 1/n * Sum[(y_true*log(y_pred) + (1-y_true)\*log(1-y_pred))]
1. Create a function `log_loss_custom` and compute the loss for the data below:
```
y_true = np.array([0,1,1,0,1])
y_pred = np.array([0.1,0.8,0.6, 0.5, 0.3])
```
Check that `log_loss` from `sklearn.metrics` returns the same result
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html
---
---
# Exercise 4: Forward propagation
The goal of this exerice is to compute the log loss on the output of the forward propagation. The data used is the tiny data set below.
| name | math | chemistry | exam_success |
| :--- | ---: | --------: | -----------: |
| Bob | 12 | 15 | 1 |
| Eli | 10 | 9 | 0 |
| Tom | 18 | 18 | 1 |
| Ryan | 13 | 14 | 1 |
The goal if the network is to predict the success at the exam given math and chemistry grades. The inputs are `math` and `chemistry` and the target is `exam_sucess`.
1. Compute and return the output of the neural network for each of the students. Here are the weights and biases of the neural network:
```
neuron_h1 = Neuron(0.05, 0.001, 0)
neuron_h2 = Neuron(0.02, 0.003, 0)
neuron_o1 = Neuron(2,0,0)
```
2. Compute the logloss for the data given the output of the neural network with the 4 students.
---
---
# Exercise 5: Regression
The goal of this exercise is to learn to adapt the output layer to regression.
As a reminder, one of reasons for which the sigmoid is used in classification is because it contracts the output between 0 and 1 which is the expected output range for a probability (W2D2: Logistic regression). However, the output of the regression is not a probability.
In order to perform a regression using a neural network, the activation function of the neuron on the output layer has to be modified to **identity function**. In mathematics, the identity function is: **f(x) = x**. In other words it means that it returns the input as so. The three steps become:
1. Each input is multiplied by a weight
- x1 -> x1 \* w1
- x2 -> x2 \* w2
2. The weighted inputs are added together with a biais b
- (x1 _ w1) + (x2 _ w2) + b
3. The sum is passed through an activation function
- y = f((x1 _ w1) + (x2 _ w2) + b)
- The activation function is **the identity**
- y = (x1 _ w1) + (x2 _ w2) + b
All other neurons' activation function **doesn't change**.
1. Adapt the neuron class implemented in exercise 1. It now takes as a parameter `regression` which is boolean. When its value is `True`, `feedforward` should use the identity function as activation function instead of the sigmoid function.
```
class Neuron:
def __init__(self, weight1, weight2, bias, regression):
self.weights_1 = weight1
self.weights_2 = weight2
self.bias = bias
#TODO
def feedforward(self, x1, x2):
#TODO
return y
```
- Compute the output for:
```
neuron = Neuron(0,1,4, True)
neuron.feedforward(2,3)
```
2. Now, the goal of the network is to predict the physics' grade at the exam given math and chemistry grades. The inputs are `math` and `chemistry` and the target is `physics`.
| name | math | chemistry | physics |
| :--- | ---: | --------: | ------: |
| Bob | 12 | 15 | 16 |
| Eli | 10 | 9 | 10 |
| Tom | 18 | 18 | 19 |
| Ryan | 13 | 14 | 16 |
Compute and return the output of the neural network for each of the students. Here are the weights and biases of the neural network:
```
#replace regression by the right value
neuron_h1 = Neuron(0.05, 0.001, 0, regression)
neuron_h2 = Neuron(0.002, 0.003, 0, regression)
neuron_o1 = Neuron(2,7,10, regression)
```
3. Compute the MSE for the 4 students.

75
subjects/ai/neural-networks/audit/README.md

@ -0,0 +1,75 @@
#### Exercise 0: Environment and libraries
##### The exercise is validated is all questions of the exercise are validated.
##### Activate the virtual environment. If you used `conda` run `conda activate your_env`.
##### Run `python --version`.
###### Does it print `Python 3.x`? x >= 8
##### Do `import jupyter` and `import numpy` run without any error ?
---
---
#### Exercise 1: The neuron
##### The question 1 is validated if this code:
```
neuron = Neuron(0,1,4)
neuron.feedforward(2,3)
```
returns **0.9990889488055994**.
---
---
#### Exercise 2: Neural network
##### The question 1 is validated the output is: **0.9524917424084265**
---
---
#### Exercise 3: Log loss
##### The question 1 is validated if the output is: **0.5472899351247816**.
---
---
#### Exercise 4: Forward propagation
##### The exercise is validated is all questions of the exercise are validated
##### The question 1 is validated if the output is:
```
Bob: 0.7855253278357536
Eli: 0.7771516558846259
Tom: 0.8067873659804015
Ryan: 0.7892343955586032
```
##### The question 2 is validated if the logloss for the 4 students is **0.5485133607757963**.
---
---
#### Exercise 5: Regression
##### The exercise is validated is all questions of the exercise are validated
##### The question 1 is validated if the output is **7**.
##### The question 2 is validated if the outputs are:
```
Bob: 14.918863163724454
Eli: 14.83137890625537
Tom: 15.086662606964074
Ryan: 14.939270885974128
```
##### The question 3 is validated if the MSE is **10.237608699909138**

BIN
subjects/ai/neural-networks/w3_day1_neural_network.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 131 KiB

BIN
subjects/ai/neural-networks/w3_day1_neuron.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 62 KiB

736
subjects/ai/nlp-scraper/BBC News Test.csv

diff.file_suppressed_line_too_long

1491
subjects/ai/nlp-scraper/BBC News Train.csv

diff.file_suppressed_line_too_long

176
subjects/ai/nlp-scraper/README.md

@ -0,0 +1,176 @@
# NLP-enriched News Intelligence platform
The goal of this project is to build an NLP-enriched News Intelligence platform. News analysis is a trending and important topic. The analysts get their information from the news and the amount of available information is limitless. Having a platform that helps to detect the relevant information is definitely valuable.
The platform connects to a news data source, detects the entities, detects the topic of the article, analyse the sentiment and ...
### Scrapper
News data source:
- Find a news website that is easy to scrap. I could have chosen the website but the news' websites change their scraping policy frequently.
- Store it:
- File system per day:
- URL, date unique id
- headline
- body of the article
- SQL database (optional)
from the last week otherwise the volume may be to high
### NLP engine
In production architectures, the NLP engine delivers a live output based on the news that are delivered in a live stream data by the scrapper. However, it required advanced Python skills that are not a pre-requisite for the AI branch.
To simplify this step the scrapper and the NLP engine are used independently in the project. The scrapper fetches the news and store them in the data structure (either the file systeme or the SQL database) and then, the NLP engine runs on the stored data.
Here how the NLP engine should process the news:
### **1. Entities detection:**
The goal is to detect all the entities in the document (headline and body). The type of entity we focus on is `ORG`. This corresponds to companies and organisations. This information should be stored.
- Detect all companies using SpaCy NER on the body of the text.
https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da
### **2. Topic detection:**
The goal is to detect what the article is dealing with: Tech, Sport, Business, Entertainment or Politics. To do so, a labelled dataset is provided. From this dataset, build a classifier that learns to detect the right topic in the article. The trained model should be stored as `topic_classifier.pkl`. Make sure the model can be used easily (with the preprocessing pipeline built for instance) because the audit requires the auditor to test the model.
Save the plot of learning curves (`learning_curves.png`) in `results` to prove that the model is trained correctly and not overfitted.
- Learning constraints: **Score on test: > 95%**
- **Optional**: If you want to train a news' topic classifier based on a more challenging dataset, you can use the following which based on 200k news headlines. https://www.kaggle.com/rmisra/news-category-dataset.
### **3. Sentiment analysis:**
The goal is to detect the sentiment of the news articles. To do so, use a pre-trained sentiment model. I suggest to use: NLTK.
There are 3 reasons for which we use a pre-trained model:
1. As a Data Scientist, you should learn to use a pre-trained model. There are so many models available and trained that sometimes you don't need to train one from scratch.
2. Labelled news data for sentiment analysis are very expensive. Companies as SESAMm provide this kind of services.
3. You already know how to train a sentiment analysis classifier ;-)
### **4. Scandal detection **
The goal is to detect environmental disaster for the detected companies. Here is the methodology that should be used:
- Define keywords that correspond to environmental disaster that may be caused by companies: pollution, deforestation etc ... Here is an example of disaster we want to detect: https://en.wikipedia.org/wiki/MV_Erika. Pay attention to not use ambigous words that make sense in the context of an environmental disaster but also in another context. This would lead to detect a false positive natural disaster.
- Compute the embeddings of the keywords.
- Compute the distance between the embeddings of the keywords and all sentences that contain an entity. Explain in the `README.md` the embeddings chosen and why. Similarly explain the distance or similarity chosen and why.
- Save the distance
- Flag the top 10 articles.
### 5. **Source analysis (optional)**
The goal is to show insights about the news' source you scrapped.
This requires to scrap data on at least 5 days (a week ideally). Save the plots in the `results` folder.
Here are examples of insights:
- Per day:
- Proportion of topics per day
- Number of articles
- Number of companies mentioned
- Sentiment per day
- Per companies:
- Companies mentioned the most
- Sentiment per companies
### Deliverables
The structure of the project is:
```
project
│ README.md
│ environment.yml
└───data
│ │ topic_classification_data.csv
└───results
│ │ topic_classifier.pkl
│ │ learning_curves.png
│ │ enhanced_news.csv
|
|───nlp_engine
```
1. Run the scrapper until it fetches at least 300 articles
```
python scrapper_news.py
1. scrapping <URL>
requesting ...
parsing ...
saved in <path>
2. scrapping <URL>
requesting ...
parsing ...
saved in <path>
```
2. Run on this 300 articles the NLP engine.
Save a DataFrame:
Date scrapped (date)
Title (str)
URL (str)
Body (str)
Org (str)
Topics (list str)
Sentiment (list float or float)
Scandal_distance (float)
Top_10 (bool)
```prompt
python nlp_enriched_news.py
Enriching <URL>:
Cleaning document ... (optional)
---------- Detect entities ----------
Detected <X> companies which are <company_1> and <company_2>
---------- Topic detection ----------
Text preprocessing ...
The topic of the article is: <topic>
---------- Sentiment analysis ----------
Text preprocessing ... (optional)
The title which is <title> is <sentiment>
The body of the article is <sentiment>
---------- Scandal detection ----------
Computing embeddings and distance ...
Environmental scandal detected for <entity>
```
I strongly suggest to create a data structure (dictionary for example) to save all the intermediate result. Then, a boolean argument `cache` fetched the intermediate results when they are already computed.
Ressources:
- https://www.youtube.com/watch?v=XVv6mJpFOb0

112
subjects/ai/nlp-scraper/audit/README.md

@ -0,0 +1,112 @@
#### NLP-enriched News Intelligence platform
##### Preliminary
```
project
│ README.md
│ environment.yml
└───data
│ │ topic_classification_data.csv
└───results
│ │ topic_classifier.pkl
│ │ learning_curves.png
│ │ enhanced_news.csv
|
|───nlp_engine
```
###### Does the structure of the project is as below ?
###### Does the readme file give an introduction of the project, show the username, describe the feature engineering and show the best score the on the leaderboard ?
###### Does the environment contain all libraries used and their versions that are necessary to run the code ?
##### Scrapper
##### There are at least 300 news articles stored in the file system or the database.
##### Run the scrapper with `python scrapper_news.py` and fetch 3 documents. The scrapper is not expected to fetch 3 documents and stop by itself, you can stop it manually. It runs without any error and stores the 3 files as expected.
##### Topic classifier
###### Are the learning curves provided ?
###### Do the learning curves prove the topics classifier is trained without correctly - without overfitting ?
###### Can you run the topic classfier model on the test set without any error ?
###### Does the topic classifier score an accuracy higher than 95% ?
##### Scandal detection
###### Does the `README.md` explain the choice of embeddings and distance ?
###### Does the DataFrame flag the top 10 articles with the highest likelihood of environmental scandal ?
###### Is the distance or similarity saved in the DataFrame ?
#####
##### NLP engine output on 300 articles
###### Does the DataFrame contain 300 different rows ?
###### Does the columns of the DataFrame are as expected ?
```
Date scrapped (date)
Title (str)
URL (str)
Body (str)
Org (str)
Topics (list str)
Sentiment (list float or float)
Scandal_distance (float)
Top_10 (bool)
```
##### Analyse the DataFrame with 300 articles: relevance of the topics matched, relevance of the sentiment, relevance of the scandal detected and relevance of the companies matched. The algorithms are not 100% accurate so you should expect a few issues in the results.
##### NLP engine on 3 articles
###### Can you run `python nlp_enriched_news.py` without any error ?
###### Does the output of the nlp engine correspond to the output below?
```prompt
python nlp_enriched_news.py
Enriching <URL>:
Cleaning document ... (optional)
---------- Detect entities ----------
Detected <X> companies which are <company_1> and <company_2>
---------- Topic detection ----------
Text preprocessing ...
The topic of the article is: <topic>
---------- Sentiment analysis ----------
Text preprocessing ... (optional)
The title which is <title> is <sentiment>
The body of the article is <sentiment>
---------- Scandal detection ----------
Computing embeddings and distance ...
Environmental scandal detected for <entity>
```
##### Analyse the output: relevance of the topic(s) matched, relevance of the sentiment, relevance of the scandal detected (if detected on the three articles) and relevance of the companie(s) matched.

255
subjects/ai/numpy/README.md

@ -0,0 +1,255 @@
# NumPy
The goal of this day is to understand practical usage of **NumPy**. **NumPy** is a commonly used Python data analysis package. By using **NumPy**, you can speed up your workflow, and interface with other packages in the Python ecosystem, like scikit-learn, that use **NumPy** under the hood. **NumPy** was originally developed in the mid 2000s, and arose from an even older package called Numeric. This longevity means that almost every data analysis or machine learning package for Python leverages **NumPy** in some way.
### Exercises of the day
- Exercise 0: Environment and libraries
- Exercise 1: Your first NumPy array
- Exercise 2: Zeros
- Exercise 3: Slicing
- Exercise 4: Random
- Exercise 5: Split, concatenate, reshape arrays
- Exercise 6: Broadcasting and Slicing
- Exercise 7: NaN
- Exercise 8: Wine
- Exercise 9: Football tournament
### Virtual Environment
- Python 3.x
- NumPy
- Jupyter or JupyterLab
_Version of NumPy I used to do the exercises: 1.18.1_.
I suggest to use the most recent one.
### Resources
- https://medium.com/fintechexplained/why-should-we-use-NumPy-c14a4fb03ee9
- https://numpy.org/doc/
- https://jakevdp.github.io/PythonDataScienceHandbook/
---
---
# Exercise 0: Environment and libraries
The goal of this exercise is to set up the Python work environment with the required libraries and to learn to launch a `jupyter notebook`. Jupyter notebooks are very convenient as they allow to write and test code within seconds. However, it really easy to implement instable and not reproducible code using notebooks. Keep the notebook and the underlying code clean. An article below detail when the Notebook should be used. Notebook can be used for most of the exercices of the piscine as the goal is to experiment A LOT. But no worries, you'll be asked to build a more robust structure for all the projects.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
I recommend to use:
- the **last stable versions** of Python. However, for educational purpose you will install a specific version of Python in this exercise.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
1. Create a virtual environment named `ex00`, with Python `3.8`, with the following libraries: `numpy`, `jupyter`. Save the installed packages in `requirements.txt` in the current directory.
2. Launch a `jupyter notebook` on port `8891` and create a notebook named `Notebook_ex00`. `JupyterLab` can be used instead of Jupyter Notebook here.
3. Put the text `H1 TITLE` as **heading level 1** and `H2 TITLE` as **heading level 2** in the first cell.
4. Run `print("Buy the dip ?")` in the second cell
### Resources:
- https://www.python.org/
- https://docs.conda.io/
- https://jupyter.org/
- https://numpy.org/
- https://towardsdatascience.com/jypyter-notebook-shortcuts-bf0101a98330
- https://odsc.medium.com/why-you-should-be-using-jupyter-notebooks-ea2e568c59f2
- https://stackoverflow.com/questions/50777849/from-conda-create-requirements-txt-for-pip3
---
---
# Exercise 1: Your first NumPy array
The goal of this exercise is to use many Python data types in **NumPy** arrays. **NumPy** arrays are intensively used in **NumPy** and **Pandas**. They are flexible and allow to use optimized **NumPy** underlying functions.
1. Create a NumPy array that contains: an integer, a float, a string, a dictionary, a list, a tuple, a set and a boolean. Add the following code at the end of your python file or in a cell of the jupyter notebook:
```python
for i in your_np_array:
print(type(i))
```
---
---
# Exercise 2: Zeros
The goal of this exercise is to learn to create a NumPy array with 0s.
1. Create a NumPy array of dimension **300** with zeros without filling it manually
2. Reshape it to **(3,100)**
---
---
# Exercise 3: Slicing
The goal of this exercise is to learn NumPy indexing/slicing. It allows to access values of the NumPy array efficiently and without a for loop.
1. Create a NumPy array of dimension 1 that contains all integers from 1 to 100 ordered.
2. Without using a for loop and using the array created in Q1, create an array that contain all odd integers. The expected output is: `np.array([1,3,...,99])`. _Hint_: it takes one line
3. Without using a for loop and using the array created in Q1, create an array that contain all even integers reversed. The expected output is: `np.array([100,98,...,2])`. _Hint_: it takes one line
4. Using array of Q1, set the value of every 3 elements of the list (starting with the second) to 0. The expected output is: `np.array([[1,0,3,4,0,...,0,99,100]])`
---
---
# Exercise 4: Random
The goal of this exercise is to learn to generate random data.
In Data Science it is extremely useful to generate random data for many reasons:
Lack of real data, create a random benchmark, use varied data sets.
NumPy proposes a lot of options to generate random data. In statistics, assumptions are made on the distribution the data is from. All data distribution that can be generated randomly are described in the documentation. In this exercise we will focus on two distributions:
- Uniform: For example, if your goal is to generate a random number from 1 to 100 and that the probability that all the numbers is equal you'll need the uniform distribution. NumPy provides `randint` and `uniform` to generate uniform distribution
- Normal: The normal distribution is the most important probability distribution in statistics because it fits many natural phenomena.For example, if you need to generate a data sample that represents **Heights of 14 Year Old Girls** it can be done using the normal distribution. In that case, we need two parameters: the mean (1m51) and the standard deviation (0.0741m). NumPy provides `randn` to generate normal distribution (among other)
https://numpy.org/doc/stable/reference/random/generator.html
1. Set the seed to 888
2. Generate a **one-dimensional** array of size 100 with a normal distribution
3. Generate a **two-dimensional** array of size 8,8 with random integers from 1 to 10 - both included (same probability for each integer)
4. Generate a **three-dimensional** of size 4,2,5 array with random integers from 1 to 17 - both included (same probability for each integer)
---
---
# Exercise 5: Split, concatenate, reshape arrays
The goal of this exercise is to learn to concatenate and reshape arrays.
1. Generate an array with integers from 1 to 50: `array([1,...,50])`
2. Generate an array with integers from 51 to 100: `array([51,...,100])`
3. Using `np.concatenate`, concatenate the two arrays into: `array([1,...,100])`
4. Reshape the previous array into:
```console
array([[ 1, ... , 10],
...
[ 91, ... , 100]])
```
---
---
# Exercise 6: Broadcasting and Slicing
The goal of this exercise is to learn to access values of n-dimensional arrays efficiently.
1. Create an 2-dimensional array size 9,9 of 1s. Each value has to be an `int8`.
2. Using **slicing**, output this array:
```python
array([[1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 0, 0, 0, 0, 0, 0, 0, 1],
[1, 0, 1, 1, 1, 1, 1, 0, 1],
[1, 0, 1, 0, 0, 0, 1, 0, 1],
[1, 0, 1, 0, 1, 0, 1, 0, 1],
[1, 0, 1, 0, 0, 0, 1, 0, 1],
[1, 0, 1, 1, 1, 1, 1, 0, 1],
[1, 0, 0, 0, 0, 0, 0, 0, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int8)
```
https://jakevdp.github.io/PythonDataScienceHandbook/ (section: Computation on Arrays: Broadcasting)
---
---
# Exercise 7: NaN
The goal of this exercise is to learn to deal with missing data in NumPy and to manipulate NumPy arrays.
Let us consider a 2-dimensional array that contains the grades at the past two exams. Some of the students missed the first exam. As the grade is missing it has been replaced with a `NaN`.
1. Using `np.where` create a third column that is equal to the grade of the first exam if it exists and the second else. Add the column as the third column of the array.
**Using a for loop or if/else statement is not allowed in this exercise.**
```python
import numpy as np
generator = np.random.default_rng(123)
grades = np.round(generator.uniform(low = 0.0, high = 10.0, size = (10, 2)))
grades[[1,2,5,7], [0,0,0,0]] = np.nan
print(grades)
```
---
---
# Exercise 8: Wine
The goal of this exercise is to learn to perform a basic data analysis on real data using NumPy.
The data set that will be used for this exercise is the red wine data set.
https://archive.ics.uci.edu/ml/datasets/wine+quality
How to tell if a given 2D array has null columns?
1. Using `genfromtxt` load the data and reduce the size of the numpy array by optimizing the types. The sum of absolute differences between the original data set and the "memory" optimized one has to be smaller than 1.10**-3. I suggest to use `np.float32`. Check that the numpy array weights **76800 bytes\*\*.
2. Print 2nd, 7th and 12th rows as a two dimensional array
3. Is there any wine with a percentage of alcohol greater than 20% ? Return True or False
4. What is the average % of alcohol on all wines in the data set ? If needed, drop `np.nan` values
5. Compute the minimum, the maximum, the 25th percentile, the 50th percentile, the 75th percentile, the median (50th percentile) of the pH
6. Compute the average quality of the wines having the 20% least sulphates
7. Compute the mean of all variables for wines having the best quality. Same question for the wines having the worst quality
---
---
# Exercise 9: Football tournament
The goal of this exercise is to learn to use permutations, complex
A Football tournament is organized in your city. There are 10 teams and the director of the tournaments wants you to create a first round as exciting as possible. To do so, you are allowed to choose the pairs. As a former data scientist, you implemented a model based on teams' current season performance. This models predicts the score difference between two teams. You used this algorithm to predict the score difference for every possible pair.
The matrix returned is a 2-dimensional array that contains in (i,j) the score difference between team i and j. The matrix is in `model_forecasts.txt`.
Using this output, what are the pairs that will give the most interesting matches ?
If a team wins 7-1 the match is obviously less exciting than a match where the winner wins 2-1.
The criteria that corresponds to **the pairs that will give the most interesting matches** is **the pairs that minimize the sum of squared differences**
The expected output is:
```console
[[m1_t1 m2_t1 m3_t1 m4_t1 m5_t1]
[m1_t2 m2_t2 m3_t2 m4_t2 m5_t2]]
```
- m1_t1 stands for match1_team1
- m1_t1 plays against m1_t2 ...
**Usage of for loop is not allowed, you may need to use the library** `itertools` **to create permutations**
https://docs.python.org/3.9/library/itertools.html

314
subjects/ai/numpy/audit/README.md

@ -0,0 +1,314 @@
#### Exercise 0: Environment and libraries
##### The exercice is validated if all questions of the exercice are validated
##### Install the virtual environment with `requirements.txt`
##### Activate the virtual environment. If you used `conda`, run `conda activate ex00`
###### Does the shell specify the name `ex00` of the environment on the left ?
##### Run `python --version`
###### Does it print `Python 3.8.x`? x could be any number from 0 to 9
##### Does `import jupyter` and `import numpy` run without any error ?
###### Have you used the followingthe command `jupyter notebook --port 8891` ?
###### Is there a file named `Notebook_ex00.ipynb` in the working directory ?
###### Is the following markdown code executed in a markdown cell in the first cell ?
```
# H1 TITLE
## H2 TITLE
```
###### Does the second cell contain `print("Buy the dip ?")` and return `Buy the dip ?` in the output section ?
---
---
#### Exercise 1: Your first NumPy array
##### Add cell and run `type(your_numpy_array)`
###### Is the your_numpy_array an NumPy array ? It can be checked with that should be equal to `numpy.ndarray`.
##### Run all the cells of the notebook or `python main.py`
###### Are the types printed are as follows ?
```
<class 'int'>
<class 'float'>
<class 'str'>
<class 'dict'>
<class 'list'>
<class 'tuple'>
<class 'set'>
<class 'bool'>
```
##### Delete all the cells you added for the audit and restart the notebook
TODO
---
---
#### Exercise 2: Zeros
##### The exercice is validated is all questions of the exercice are validated
##### The question 1 is validated is the solution uses `np.zeros` and if the shape of the array is `(300,)`
##### The question 2 is validated if the solution uses `reshape` and the shape of the array is `(3, 100)`
---
---
#### Exercise 3: Slicing
##### The exercice is validated is all questions of the exercice are validated
##### The question 1 is validated if the solution doesn't involve a for loop or writing all integers from 1 to 100 and if the array is: `np.array([1,...,100])`. The list from 1 to 100 can be generated with an iterator: `range`.
##### The question 2 is validated if the solution is: `integers[::2]`
##### The question 3 is validated if the solution is: `integers[::-2]`
##### The question 4 is validated if the array is: `np.array([0, 1,0,3,4,0,...,0,99,100])`. There are at least two ways to get this results without for loop. The first one uses `integers[1::3] = 0` and the second involves creating a boolean array that indexes the array:
```python
mask = (integers+1)%3 == 0
integers[mask] = 0
```
---
---
#### Exercise 4: Random
##### The exercice is validated is all questions of the exercice are validated
##### For this exercise, as the results may change depending on the version of the package or the OS, I give the code to correct the exercise. If the code is correct and the output is not the same as mine, it is accepted.
##### The question 1 is validated if the solution is: `np.random.seed(888)`
##### The question 2 is validated if the solution is: `np.random.randn(100)`. The value of the first element is `0.17620087373662233`.
##### The question 3 is validated if the solution is: `np.random.randint(1,11,(8,8))`.
```console
Given the NumPy version and the seed, you should have this output:
array([[ 7, 4, 8, 10, 2, 1, 1, 10],
[ 4, 1, 7, 4, 3, 5, 2, 8],
[ 3, 9, 7, 4, 9, 6, 10, 5],
[ 7, 10, 3, 10, 2, 1, 3, 7],
[ 3, 2, 3, 2, 10, 9, 5, 4],
[ 4, 1, 9, 7, 1, 4, 3, 5],
[ 3, 2, 10, 8, 6, 3, 9, 4],
[ 4, 4, 9, 2, 8, 5, 9, 5]])
```
##### The question 4 is validated if the solution is: `np.random.randint(1,18,(4,2,5))`.
```console
Given the NumPy version and the seed, you should have this output:
array([[[14, 16, 8, 15, 14],
[17, 13, 1, 4, 17]],
[[ 7, 15, 2, 8, 3],
[ 9, 4, 13, 9, 15]],
[[ 5, 11, 11, 14, 10],
[ 2, 1, 15, 3, 3]],
[[ 3, 10, 5, 16, 13],
[17, 12, 9, 7, 16]]])
```
---
---
#### Exercise 5: Split, concatenate, reshape arrays
##### The exercice is validated is all questions of the exercice are validated
##### The question 1 is validated if the generated array is based on an iterator as `range` or `np.arange`. Check that 50 is part of the array.
##### The question 2 is validated if the generated array is based on an iterator as `range` or `np.arange`. Check that 100 is part of the array.
##### The question 3 is validated if the array is concatenated this way `np.concatenate(array1,array2)`.
##### The question 4 is validated if the result is:
```console
array([[ 1, ... , 10],
...
[ 91, ... , 100]])
```
The easiest way is to use `array.reshape(10,10)`.
https://jakevdp.github.io/PythonDataScienceHandbook/ (section: The Basics of NumPy Arrays)
---
---
#### Exercise 6: Broadcasting and Slicing
##### The exercice is validated is all questions of the exercice are validated
##### The question 1 is validated if the output is the same as:
`np.ones([9,9], dtype=np.int8)`
##### The question 2 is validated if the output is
```console
array([[1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 0, 0, 0, 0, 0, 0, 0, 1],
[1, 0, 1, 1, 1, 1, 1, 0, 1],
[1, 0, 1, 0, 0, 0, 1, 0, 1],
[1, 0, 1, 0, 1, 0, 1, 0, 1],
[1, 0, 1, 0, 0, 0, 1, 0, 1],
[1, 0, 1, 1, 1, 1, 1, 0, 1],
[1, 0, 0, 0, 0, 0, 0, 0, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int8)
```
##### The solution of question 2 is not accepted if the values of the array have been changed one by one manually. The usage of the for loop is not allowed neither.
Here is an example of a possible solution:
```console
x[1:8,1:8] = 0
x[2:7,2:7] = 1
x[3:6,3:6] = 0
x[4,4] = 1
```
---
---
#### Exercise 7: NaN
##### The exercice is validated is all questions of the exercice are validated
##### This question is validated if, without having used a for loop or having filled the array manually, the output is:
```console
[[ 7. 1. 7.]
[nan 2. 2.]
[nan 8. 8.]
[ 9. 3. 9.]
[ 8. 9. 8.]
[nan 2. 2.]
[ 8. 2. 8.]
[nan 6. 6.]
[ 9. 2. 9.]
[ 8. 5. 8.]]
```
There are two steps in this exercise:
- Create the vector that contains the grade of the first exam if available or the second. This can be done using `np.where`:
```python
np.where(np.isnan(grades[:, 0]), grades[:, 1], grades[:, 0])
```
- Add this vector as third column of the array. Here are two ways:
```python
np.insert(arr = grades, values = new_vector, axis = 1, obj = 2)
np.hstack((grades, new_vector[:, None]))
```
---
---
#### Exercise 8: Wine
##### The exercice is validated is all questions of the exercice are validated
##### This question is validated if the text file has successfully been loaded in a NumPy array with
`genfromtxt('winequality-red.csv', delimiter=',')` and the reduced arrays weights **76800 bytes**
##### This question is validated if the output is
```python
array([[ 7.4 , 0.7 , 0. , 1.9 , 0.076 , 11. , 34. ,
0.9978, 3.51 , 0.56 , 9.4 , 5. ],
[ 7.4 , 0.66 , 0. , 1.8 , 0.075 , 13. , 40. ,
0.9978, 3.51 , 0.56 , 9.4 , 5. ],
[ 6.7 , 0.58 , 0.08 , 1.8 , 0.097 , 15. , 65. ,
0.9959, 3.28 , 0.54 , 9.2 , 5. ]])
```
This slicing gives the answer `my_data[[1,6,11],:]`.
##### This question is validated if the answer if False. There many ways to get the answer: find the maximum or check values greater than 20.
##### This question is validated if the answer is 10.422983114446529.
##### This question is validated if the answers is:
```console
pH stats
25 percentile: 3.21
50 percentile: 3.31
75 percentile: 3.4
mean: 3.3111131957473416
min: 2.74
max: 4.01
```
> *Note: Using `percentile` or `median` may give different results depending on the duplicate values in the column. If you do not have my results please use `percentile`.*
##### This question is validated if the answer is ~`5.2`. The first step is to get the percentile 20% of the column `sulphates`, then create a boolean array that contains `True` of the value is smaller than the percentile 20%, then select this rows with the column quality and compute the `mean`.
##### This question is validated if the output for the best wines is:
```python
array([ 8.56666667, 0.42333333, 0.39111111, 2.57777778, 0.06844444,
13.27777778, 33.44444444, 0.99521222, 3.26722222, 0.76777778,
12.09444444, 8. ])
```
##### This question is validated if the output for the bad wines is:
```python
array([ 8.36 , 0.8845 , 0.171 , 2.635 , 0.1225 , 11. ,
24.9 , 0.997464, 3.398 , 0.57 , 9.955 , 3. ])
```
This can be done in three steps: Get the max, create a boolean mask that indicates rows with max quality, use this mask to subset the rows with the best quality and compute the mean on the axis 0.
---
---
#### Exercise 9: Football tournament
##### This exercise is validated if the output is:
```console
[[0 3 1 2 4]
[7 6 8 9 5]]
```

10
subjects/ai/numpy/data/model_forecasts.txt

@ -0,0 +1,10 @@
nan -9.480000000000000426e+00 1.415000000000000036e+01 1.126999999999999957e+01 -5.650000000000000355e+00 3.330000000000000071e+00 1.094999999999999929e+01 -2.149999999999999911e+00 5.339999999999999858e+00 -2.830000000000000071e+00
9.480000000000000426e+00 nan 4.860000000000000320e+00 -8.609999999999999432e+00 7.820000000000000284e+00 -1.128999999999999915e+01 1.324000000000000021e+01 4.919999999999999929e+00 2.859999999999999876e+00 9.039999999999999147e+00
-1.415000000000000036e+01 -1.126999999999999957e+01 nan 1.227999999999999936e+01 -2.410000000000000142e+00 6.040000000000000036e+00 -5.160000000000000142e+00 -3.870000000000000107e+00 -1.281000000000000050e+01 1.790000000000000036e+00
5.650000000000000355e+00 -3.330000000000000071e+00 -1.094999999999999929e+01 nan -1.364000000000000057e+01 0.000000000000000000e+00 2.240000000000000213e+00 -3.609999999999999876e+00 -7.730000000000000426e+00 8.000000000000000167e-02
2.149999999999999911e+00 -5.339999999999999858e+00 2.830000000000000071e+00 -4.860000000000000320e+00 nan -8.800000000000000044e-01 -8.570000000000000284e+00 2.560000000000000053e+00 -7.030000000000000249e+00 -6.330000000000000071e+00
8.609999999999999432e+00 -7.820000000000000284e+00 1.128999999999999915e+01 -1.324000000000000021e+01 -4.919999999999999929e+00 nan -1.296000000000000085e+01 -1.282000000000000028e+01 -1.403999999999999915e+01 1.456000000000000050e+01
-2.859999999999999876e+00 -9.039999999999999147e+00 -1.227999999999999936e+01 2.410000000000000142e+00 -6.040000000000000036e+00 5.160000000000000142e+00 nan -1.091000000000000014e+01 -1.443999999999999950e+01 -1.372000000000000064e+01
3.870000000000000107e+00 1.281000000000000050e+01 -1.790000000000000036e+00 1.364000000000000057e+01 -0.000000000000000000e+00 -2.240000000000000213e+00 3.609999999999999876e+00 nan 1.053999999999999915e+01 -1.417999999999999972e+01
7.730000000000000426e+00 -8.000000000000000167e-02 8.800000000000000044e-01 8.570000000000000284e+00 -2.560000000000000053e+00 7.030000000000000249e+00 6.330000000000000071e+00 1.296000000000000085e+01 nan -1.169999999999999929e+01
1.282000000000000028e+01 1.403999999999999915e+01 -1.456000000000000050e+01 1.091000000000000014e+01 1.443999999999999950e+01 1.372000000000000064e+01 -1.053999999999999915e+01 1.417999999999999972e+01 1.169999999999999929e+01 nan

1600
subjects/ai/numpy/data/winequality-red.csv

File diff suppressed because it is too large diff.load

72
subjects/ai/numpy/data/winequality.names

@ -0,0 +1,72 @@
Citation Request:
This dataset is public available for research. The details are described in [Cortez et al., 2009].
Please include this citation if you plan to use this database:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016
[Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf
[bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib
1. Title: Wine Quality
2. Sources
Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009
3. Past Usage:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
In the above reference, two datasets were created, using red and white wine samples.
The inputs include objective tests (e.g. PH values) and the output is based on sensory data
(median of at least 3 evaluations made by wine experts). Each expert graded the wine quality
between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model
these datasets under a regression approach. The support vector machine model achieved the
best results. Several metrics were computed: MAD, confusion matrix for a fixed error tolerance (T),
etc. Also, we plot the relative importances of the input variables (as measured by a sensitivity
analysis procedure).
4. Relevant Information:
The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine.
For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009].
Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables
are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
These datasets can be viewed as classification or regression tasks.
The classes are ordered and not balanced (e.g. there are munch more normal wines than
excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent
or poor wines. Also, we are not sure if all input variables are relevant. So
it could be interesting to test feature selection methods.
5. Number of Instances: red wine - 1599; white wine - 4898.
6. Number of Attributes: 11 + output attribute
Note: several of the attributes may be correlated, thus it makes sense to apply some sort of
feature selection.
7. Attribute information:
For more information, read [Cortez et al., 2009].
Input variables (based on physicochemical tests):
1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
8. Missing Attribute Values: None

173
subjects/ai/pandas/README.md

@ -0,0 +1,173 @@
# Pandas
The goal of this day is to understand practical usage of **Pandas**.
As **Pandas** in intensively used in Data Science, other days of the piscine will be dedicated to it.
Not only is the **Pandas** library a central component of the data science toolkit but it is used in conjunction with other libraries in that collection.
**Pandas** is built on top of the NumPy package, meaning a lot of the structure of NumPy is used or replicated in **Pandas**. Data in **Pandas** is often used to feed statistical analysis in SciPy, plotting functions from Matplotlib, and machine learning algorithms in Scikit-learn.
Most of the topics we will cover today are explained and describes with examples in the first resource. The number of exercises is low on purpose: Take the time to understand the chapter 5 of the resource, even if there are 40 pages.
### Exercises of the day
- Exercice 0: Environment and libraries
- Exercise 1: Your first DataFrame
- Exercise 2: Electric power consumption
- Exercise 3: E-commerce purchases
- Exercise 4: Handling missing values
### Virtual Environment
- Python 3.x
- NumPy
- Pandas
- Jupyter or JupyterLab
_Version of Pandas I used to do the exercises: 1.0.1_.
I suggest to use the most recent one.
### Resources
- If I had to give you one resource it would be this one:
https://bedford-computing.co.uk/learning/wp-content/uploads/2015/10/Python-for-Data-Analysis.pdf
It contains ALL you need to know about Pandas.
- Pandas documentation:
- https://pandas.pydata.org/docs/
- https://jakevdp.github.io/PythonDataScienceHandbook/
- https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
- https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/
- https://jakevdp.github.io/PythonDataScienceHandbook/03.04-missing-values.html
---
---
# Exercise 0: Environment and libraries
The goal of this exercise is to set up the Python work environment with the required libraries.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
I recommend to use:
- the **last stable versions** of Python.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy` and `jupyter`.
---
---
# Exercise 1: Your first DataFrame
The goal of this exercise is to learn to create basic Pandas objects.
1. Create a DataFrame as below this using two ways:
- From a NumPy array
- From a Pandas Series
| | color | list | number |
| --: | :---- | :------ | -----: |
| 1 | Blue | [1, 2] | 1.1 |
| 3 | Red | [3, 4] | 2.2 |
| 5 | Pink | [5, 6] | 3.3 |
| 7 | Grey | [7, 8] | 4.4 |
| 9 | Black | [9, 10] | 5.5 |
2. Print the types for every columns and the types of the first value of every columns
---
---
# Exercise 2: Electric power consumption
The goal of this exercise is to learn to manipulate real data with Pandas.
The data set used is [**Individual household electric power consumption**](https://assets.01-edu.org/ai-branch/piscine-ai/household_power_consumption.txt)
1. Delete the columns `Time`, `Sub_metering_2` and `Sub_metering_3`
2. Set `Date` as index
3. Create a function that takes as input the DataFrame with the data set and returns a DataFrame with updated types:
```python
def update_types(df):
#TODO
return df
```
4. Use `describe` to have an overview on the data set
5. Delete the rows with missing values
6. Modify `Sub_metering_1` by adding 1 to it and multiplying the total by 0.06. If x is a row the output is: (x+1)\*0.06
7. Select all the rows for which the Date is greater or equal than 2008-12-27 and `Voltage` is greater or equal than 242
8. Print the 88888th row.
9. What is the date for which the `Global_active_power` is maximal ?
10. Sort the first three columns by descending order of `Global_active_power` and ascending order of `Voltage`.
11. Compute the daily average of `Global_active_power`.
---
---
# Exercise 3: E-commerce purchases
The goal of this exercise is to learn to manipulate real data with Pandas. This exercise is less guided since the exercise 2 should have given you a nice introduction.
The data set used is **E-commerce purchases**.
Questions:
1. How many rows and columns are there?
2. What is the average Purchase Price?
3. What were the highest and lowest purchase prices?
4. How many people have English `'en'` as their Language of choice on the website?
5. How many people have the job title of `"Lawyer"` ?
6. How many people made the purchase during the `AM` and how many people made the purchase during `PM` ?
7. What are the 5 most common Job Titles?
8. Someone made a purchase that came from Lot: `"90 WT"` , what was the Purchase Price for this transaction?
9. What is the email of the person with the following Credit Card Number: `4926535242672853`
10. How many people have American Express as their Credit Card Provider and made a purchase above `$95` ?
11. How many people have a credit card that expires in `2025`?
12. What are the top 5 most popular email providers/hosts (e.g. gmail.com, yahoo.com, etc...)
---
---
# Exercise 4: Handling missing values
The goal of this exercise is to learn to handle missing values. In the previous exercise we used the first techniques: filter out the missing values. We were lucky because the proportion of missing values was low. But in some cases, dropping the missing values is not possible because the filtered data set would be too small.
This article explains the different types of missing data and how they should be handled.
https://towardsdatascience.com/data-cleaning-with-python-and-pandas-detecting-missing-values-3e9c6ebcf78b
"**It’s important to understand these different types of missing data from a statistics point of view. The type of missing data will influence how you deal with filling in the missing values.**"
- Preliminary: Drop the `flower` column
1. Fill the missing values with a different "strategy" for each column:
`sepal_length` -> `mean`
`sepal_width` -> `median`
`petal_length`, `petal_width` -> `0`
2. Fill the missing values using the median of the associated column using `fillna`.
- Bonus questions:
- Filling the missing values by 0 or the mean of the associated column is common in Data Science. In that case, explain why filling the missing values with 0 or the mean is a bad idea.
- Find a special row ;-) .

230
subjects/ai/pandas/audit/README.md

@ -0,0 +1,230 @@
#### Exercise 0: Environment and libraries
##### The exercise is validated is all questions of the exercise are validated.
##### Activate the virtual environment. If you used `conda` run `conda activate your_env`
##### Run `python --version`.
###### Does it print `Python 3.x`? x >= 8
##### Does `import jupyter`, `import numpy` and `import pandas` run without any error?
---
---
#### Exercise 1: Your first DataFrame
##### The exercise is validated is all questions of the exercise are validated.
##### The solution of question 1 is accepted if the DataFrame created is the same as the "model" DataFrame. Check that the index is not 1,2,3,4,5.
##### The solution of question 2 is accepted if the columns' types are as below and if the types of the first value of the columns are as below:
```console
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
```
```console
<class 'str'>
<class 'list'>
<class 'float'>
```
---
---
#### Exercise 2: Electric power consumption
##### The exercise is validated is all questions of the exercise are validated
##### The solution of question 1 is accepted if `drop` is used with `axis=1`.`inplace=True` may be useful to avoid to affect the result to a variable. A solution that could be accepted too (even if it's not a solution I recommend) is `del`.
##### The solution of question 2 is accepted if the DataFrame returns the output below. If the type of the index is not `dtype='datetime64[ns]'` the solution is not accepted. I recommend to use `set_index` with `inplace=True` to do so.
```python
Input: df.head().index
Output:
DatetimeIndex(['2006-12-16', '2006-12-16','2006-12-16', '2006-12-16','2006-12-16'],
dtype='datetime64[ns]', name='Date', freq=None)
```
##### The solution of question 3 is accepted if all the types are `float64` as below. The preferred solution is `pd.to_numeric` with `coerce=True`.
```python
Input: df.dtypes
Output:
Global_active_power float64
Global_reactive_power float64
Voltage float64
Global_intensity float64
Sub_metering_1 float64
dtype: object
```
##### The solution of question 4 is accepted if you use `df.describe()`.
##### The solution of question 5 is accepted if `dropna` is used and if the number of missing values is equal to 0. It is important to notice that 25979 rows contain missing values (for a total of 129895). `df.isna().sum()` allows to check the number of missing values and `df.dropna()` with `inplace=True` allows to remove the rows with missing values.
##### The solution of question 6 is accepted if one of the two approaches below were used:
```python
#solution 1
df.loc[:,'A'] = (df['A'] + 1) * 0.06
#solution 2
df.loc[:,'A'] = df.loc[:,'A'].apply(lambda x: (x+1)*0.06)
```
You may wonder `df.loc[:,'A']` is required and if `df['A'] = ...` works too. **The answer is no**. This is important in Pandas. Depending on the version of Pandas, it may return a warning. The reason is that you are affecting a value to a **copy** of the DataFrame and not in the DataFrame.
More details: https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas
##### The solution of question 7 is accepted as long as the output of `print(filtered_df.head().to_markdown())` is as below and if the number of rows is equal to **449667**.
| Date | Global_active_power | Global_reactive_power |
|:--------------------|----------------------:|------------------------:|
| 2008-12-27 00:00:00 | 0.996 | 0.066 |
| 2008-12-27 00:00:00 | 1.076 | 0.162 |
| 2008-12-27 00:00:00 | 1.064 | 0.172 |
| 2008-12-27 00:00:00 | 1.07 | 0.174 |
| 2008-12-27 00:00:00 | 0.804 | 0.184 |
##### The solution of question 8 is accepted if the output is:
```console
Global_active_power 0.254
Global_reactive_power 0.000
Voltage 238.350
Global_intensity 1.200
Sub_metering_1 0.000
Name: 2007-02-16 00:00:00, dtype: float64
```
##### The solution of question 9 if the output is `Timestamp('2009-02-22 00:00:00')`.
##### The solution of question 10 if the output of `print(sorted_df.tail().to_markdown())` is:
| Date | Global_active_power | Global_reactive_power | Voltage |
|:--------------------|----------------------:|------------------------:|----------:|
| 2008-08-28 00:00:00 | 0.076 | 0 | 234.88 |
| 2008-08-28 00:00:00 | 0.076 | 0 | 235.18 |
| 2008-08-28 00:00:00 | 0.076 | 0 | 235.4 |
| 2008-08-28 00:00:00 | 0.076 | 0 | 235.64 |
| 2008-12-08 00:00:00 | 0.076 | 0 | 236.5 |
##### The solution of question 11 is accepted if the output is as below. The solution is based on `groupby` which creates groups based on the index `Date` and aggregates the groups using the `mean`.
```console
Date
2006-12-16 3.053475
2006-12-17 2.354486
2006-12-18 1.530435
2006-12-19 1.157079
2006-12-20 1.545658
...
2010-12-07 0.770538
2010-12-08 0.367846
2010-12-09 1.119508
2010-12-10 1.097008
2010-12-11 1.275571
Name: Global_active_power, Length: 1433, dtype: float64
```
---
---
#### Exercise 3: E-commerce purchases
##### The exercise is validated is all questions of the exercise are validated.
##### To validate this exercise all answers should return the expected numerical value given in the correction AND uses Pandas. For example using NumPy to compute the mean doesn't respect the philosophy of the exercise which is to use Pandas.
##### The solution of question 1 is accepted if it contains **10000 entries** and **14 columns**. There many solutions based on: shape, info, describe.
##### The solution of question 2 is accepted if the answer is **50.34730200000025**.
Even if `np.mean` gives the solution, `df['Purchase Price'].mean()` is preferred
##### The solution of question 3 is accepted if the min is `0`and the max is `99.989999999999995`
##### The solution of question 4 is accepted if the answer is **1098**
##### The solution of question 5 is accepted if the answer is **30**
##### The solution of question 6 is accepted if the are `4932` people that made the purchase during the `AM` and `5068` people that made the purchase during `PM`. There many ways to the solution but the goal of this question was to make you use `value_counts`
##### The solution of question 7 is accepted if the answer is as below. There many ways to the solution but the goal of this question was to use `value_counts`
Interior and spatial designer 31
Lawyer 30
Social researcher 28
Purchasing manager 27
Designer, jewellery 27
##### The solution of question 8 is accepted if the purchase price is **75.1**
##### The solution of question 9 is accepted if the email address is **bondellen@williams-garza.com**
##### The solution of question 10 is accepted if the answer is **39**. The preferred solution is based on this: `df[(df['A'] == X) & (df['B'] > Y)]`
##### The solution of question 11 is accepted if the answer is **1033**. The preferred solution is based on the usage of `apply` on a `lambda` function that slices the string that contains the expiration date.
##### The solution of question 12 is accepted if the answer is as below. The preferred solution is based on the usage of `apply` on a `lambda` function that slices the string that contains the email. The `lambda` function uses `split` to split the string on `@`. Finally, `value_counts` is used to count the occurrences.
- hotmail.com 1638
- yahoo.com 1616
- gmail.com 1605
- smith.com 42
- williams.com 37
---
---
#### Exercise 4: Handling missing values
##### The exercise is validated is all questions of the exercise are validated (except the bonus question)
##### The solution of question 1 is accepted if the two steps are implemented in that order. First, convert the numerical columns to `float` and then fill the missing values. The first step may involve `pd.to_numeric(df.loc[:,col], errors='coerce')`. The second step is validated if you eliminated all missing values. However there are many possibilities to fill the missing values. Here is one of them:
example:
```python
df.fillna({0:df.sepal_length.mean(),
2:df.sepal_width.median(),
3:0,
4:0})
```
##### The solution of question 2 is accepted if the solution is `df.loc[:,col].fillna(df[col].median())`.
##### The solution of bonus question is accepted if you find out this answer: Once we filled the missing values as suggested in the first question, `df.describe()` returns this interesting summary. We notice that the mean is way higher than the median. It means that there are maybe some outliers in the data. The quantile 75 and the max confirm that: 75% of the flowers have a sepal length smaller than 6.4 cm, but the max is 6900 cm. If you check on the internet you realise this small flower can't be that big. The outliers have a major impact on the mean which equals to 56.9. Filling this value for the missing value is not correct since it doesn't correspond to the real size of this flower. That is why in that case the best strategy to fill the missing values is the median. The truth is that I modified the data set ! But real data sets ALWAYS contains outliers. Always think about the meaning of the data transformation ! If you fill the missing values by zero, it means that you consider that the length or width of some flowers may be 0. It doesn't make sense.
| | sepal_length | sepal_width | petal_length | petal_width |
| :---- | -----------: | ----------: | -----------: | ----------: |
| count | 146 | 141 | 120 | 147 |
| mean | 56.9075 | 52.6255 | 15.5292 | 12.0265 |
| std | 572.222 | 417.127 | 127.46 | 131.873 |
| min | -4.4 | -3.6 | -4.8 | -2.5 |
| 25% | 5.1 | 2.8 | 2.725 | 0.3 |
| 50% | 5.75 | 3 | 4.5 | 1.3 |
| 75% | 6.4 | 3.3 | 5.1 | 1.8 |
| max | 6900 | 3809 | 1400 | 1600 |
##### The solution of bonus question is accepted if the presence of negative values and huge values have been detected. A good data scientist always check abnormal values in the dataset. **YOU SHOULD ALWAYS TRY TO UNDERSTAND YOUR DATA**. Print the row with index 122 ;-) This week, we will have the opportunity to focus on the data pre-processing to understand how the outliers can be handled.

20001
subjects/ai/pandas/data/Ecommerce_purchases.txt

File diff suppressed because it is too large diff.load

151
subjects/ai/pandas/data/iris.csv

@ -0,0 +1,151 @@
,sepal_length,sepal_width,petal_length,petal_width, flower
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,-3.6,-1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,-4.4,2.9,1400.0,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa
10,5.4,3.7,,0.2,Iris-setosa
11,4.8,3.4,,0.2,Iris-setosa
12,4.8,3.0,,0.1,Iris-setosa
13,4.3,3.0,,0.1,Iris-setosa
14,5.8,4.0,,0.2,Iris-setosa
15,5.7,4.4,,0.4,Iris-setosa
16,5.4,3.9,,0.4,Iris-setosa
17,5.1,3.5,,0.3,Iris-setosa
18,5.7,3.8,,0.3,Iris-setosa
19,5.1,3.8,,0.3,Iris-setosa
20,5.4,3.4,,0.2,Iris-setosa
21,5.1,3.7,,0.4,Iris-setosa
22,4.6,3.6,,0.2,Iris-setosa
23,5.1,3.3,,0.5,Iris-setosa
24,4.8,3.4,,0.2,Iris-setosa
25,5.0,-3.0,,0.2,Iris-setosa
26,5.0,3.4,,0.4,Iris-setosa
27,5.2,3.5,,0.2,Iris-setosa
28,5.2,3.4,,0.2,Iris-setosa
29,4.7,3.2,,0.2,Iris-setosa
30,4.8,3.1,1.6,0.2,Iris-setosa
31,5.4,3.4,1.5,0.4,Iris-setosa
32,5.2,4.1,1.5,0.1,Iris-setosa
33,5.5,4.2,1.4,0.2,Iris-setosa
34,4.9,3.1,1.5,0.1,Iris-setosa
35,5.0,3.2,1.2,0.2,Iris-setosa
36,5.5,3.5,1.3,0.2,Iris-setosa
37,4.9,,1.5,0.1,Iris-setosa
38,4.4,3.0,1.3,0.2,Iris-setosa
39,5.1,3.4,1.5,0.2,Iris-setosa
40,5.0,3.5,1.3,0.3,Iris-setosa
41,4.5,2.3,1.3,0.3,Iris-setosa
42,4.4,3.2,1.3,0.2,Iris-setosa
43,5.0,3.5,1.6,0.6,Iris-setosa
44,5.1,3.8,1.9,0.4,Iris-setosa
45,4.8,3.0,1.4,0.3,Iris-setosa
46,5.1,3809.0,1.6,0.2,Iris-setosa
47,4.6,3.2,1.4,0.2,Iris-setosa
48,5.3,3.7,1.5,0.2,Iris-setosa
49,5.0,3.3,1.4,0.2,Iris-setosa
50,7.0,3.2,4.7,1.4,Iris-versicolor
51,6.4,3200.0,4.5,1.5,Iris-versicolor
52,6.9,3.1,4.9,1.5,Iris-versicolor
53,5.5,2.3,4.0,1.3,Iris-versicolor
54,6.5,2.8,4.6,1.5,Iris-versicolor
55,5.7,2.8,4.5,1.3,Iris-versicolor
56,6.3,3.3,4.7,1600.0,Iris-versicolor
57,4.9,2.4,3.3,1.0,Iris-versicolor
58,6.6,2.9,4.6,1.3,Iris-versicolor
59,5.2,2.7,3.9,,Iris-versicolor
60,5.0,2.0,3.5,1.0,Iris-versicolor
61,5.9,3.0,4.2,1.5,Iris-versicolor
62,6.0,2.2,4.0,1.0,Iris-versicolor
63,6.1,2.9,4.7,1.4,Iris-versicolor
64,5.6,2.9,3.6,1.3,Iris-versicolor
65,6.7,3.1,4.4,1.4,Iris-versicolor
66,5.6,3.0,4.5,1.5,Iris-versicolor
67,5.8,2.7,4.1,1.0,Iris-versicolor
68,6.2,2.2,4.5,1.5,Iris-versicolor
69,5.6,2.5,3.9,1.1,Iris-versicolor
70,5.9,3.2,4.8,1.8,Iris-versicolor
71,6.1,2.8,4.0,1.3,Iris-versicolor
72,6.3,2.5,4.9,1.5,Iris-versicolor
73,6.1,2.8,4.7,1.2,Iris-versicolor
74,6.4,2.9,4.3,1.3,Iris-versicolor
75,6.6,3.0,4.4,1.4,Iris-versicolor
76,6.8,2.8,4.8,1.4,Iris-versicolor
77,6.7,3.0,5.0,1.7,Iris-versicolor
78,6.0,2.9,4.5,1.5,Iris-versicolor
79,5.7,2.6,3.5,1.0,Iris-versicolor
80,5.5,2.4,3.8,1.1,Iris-versicolor
81,5.5,2.4,3.7,1.0,Iris-versicolor
82,5.8,2.7,3.9,1.2,Iris-versicolor
83,6.0,2.7,5.1,1.6,Iris-versicolor
84,5.4,3.0,4.5,1.5,Iris-versicolor
85,6.0,3.4,4.5,1.6,Iris-versicolor
86,6.7,3.1,4.7,1.5,Iris-versicolor
87,6.3,2.3,4.4,1.3,Iris-versicolor
88,5.6,3.0,4.1,1.3,Iris-versicolor
89,5.5,2.5,4.0,1.3,Iris-versicolor
90,5.5,2.6,4.4,1.2,Iris-versicolor
91,6.1,3.0,4.6,1.4,Iris-versicolor
92,5.8,2.6,4.0,1.2,Iris-versicolor
93,5.0,2.3,3.3,1.0,Iris-versicolor
94,5.6,2.7,4.2,1.3,Iris-versicolor
95,5.7,3.0,4.2,1.2,Iris-versicolor
96,5.7,2.9,4.2,1.3,Iris-versicolor
97,6.2,2.9,4.3,1.3,Iris-versicolor
98,5.1,2.5,3.0,1.1,Iris-versicolor
99,5.7,2.8,,1.3,Iris-versicolor
100,,3.3,,2.5,Iris-virginica
101,5.8,2.7,,1.9,Iris-virginica
102,7.1,3.0,,2.1,Iris-virginica
103,6.3,2.9,,1.8,Iris-virginica
104,6.5,3.0,,2.2,Iris-virginica
105,7.6,3.0,6.6,2.1,Iris-virginica
106,4.9,2.5,4.5,1.7,Iris-virginica
107,7.3,2.9,6.3,1.8,Iris-virginica
108,6.7,2.5,5.8,1.8,Iris-virginica
109,7.2,3.6,6.1,2.5,Iris-virginica
110,6.5,3.2,5.1,2.0,Iris-virginica
111,6.4,2.7,5.3,1.9,Iris-virginica
112,6.8,3.0,5.5,2.1,Iris-virginica
113,5.7,2.5,5.0,2.0,Iris-virginica
114,5.8,,5.1,2.4,Iris-virginica
115,6.4,,5.3,2.3,Iris-virginica
116,6.5,,5.5,1.8,Iris-virginica
117,7.7,,6.7,2.2,Iris-virginica
118,7.7,,,2.3,Iris-virginica
119,6.0,,5.0,1.5,Iris-virginica
120,6.9,,5.7,2.3,Iris-virginica
121,5.6,2.8,4.9,2.0,Iris-virginica
122,always,check,the,data,!!!!!!!!
123,6.3,2.7,4.9,1.8,Iris-virginica
124,6.7,3.3,5.7,2.1,Iris-virginica
125,7.2,3.2,6.0,1.8,Iris-virginica
126,6.2,2.8,-4.8,1.8,Iris-virginica
127,,3.0,4.9,1.8,Iris-virginica
128,6.4,2.8,5.6,2.1,Iris-virginica
129,7.2,3.0,5.8,1.6,Iris-virginica
130,7.4,2.8,6.1,1.9,Iris-virginica
131,7.9,3.8,6.4,2.0,Iris-virginica
132,6.-4,2.8,5.6,2.2,Iris-virginica
133,6.3,2.8,,1.5,Iris-virginica
134,6.1,2.6,5.6,1.4,Iris-virginica
135,7.7,3.0,6.1,2.3,Iris-virginica
136,6.3,3.4,5.6,2.4,Iris-virginica
137,6.4,3.1,5.5,1.8,Iris-virginica
138,6.0,3.0,4.8,1.8,Iris-virginica
139,6900,3.1,5.4,2.1,Iris-virginica
140,6.7,3.1,,2.4,Iris-virginica
141,6.9,3.1,5.1,2.3,Iris-virginica
142,580,2.7,5.1,,Iris-virginica
143,6.8,3.2,5.9,2.3,Iris-virginica
144,6.7,3.3,5.7,-2.5,Iris-virginica
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica
1 sepal_length sepal_width petal_length petal_width flower
2 0 5.1 3.5 1.4 0.2 Iris-setosa
3 1 4.9 3.0 1.4 0.2 Iris-setosa
4 2 4.7 3.2 1.3 0.2 Iris-setosa
5 3 4.6 3.1 1.5 0.2 Iris-setosa
6 4 5.0 -3.6 -1.4 0.2 Iris-setosa
7 5 5.4 3.9 1.7 0.4 Iris-setosa
8 6 4.6 3.4 1.4 0.3 Iris-setosa
9 7 5.0 3.4 1.5 0.2 Iris-setosa
10 8 -4.4 2.9 1400.0 0.2 Iris-setosa
11 9 4.9 3.1 1.5 0.1 Iris-setosa
12 10 5.4 3.7 0.2 Iris-setosa
13 11 4.8 3.4 0.2 Iris-setosa
14 12 4.8 3.0 0.1 Iris-setosa
15 13 4.3 3.0 0.1 Iris-setosa
16 14 5.8 4.0 0.2 Iris-setosa
17 15 5.7 4.4 0.4 Iris-setosa
18 16 5.4 3.9 0.4 Iris-setosa
19 17 5.1 3.5 0.3 Iris-setosa
20 18 5.7 3.8 0.3 Iris-setosa
21 19 5.1 3.8 0.3 Iris-setosa
22 20 5.4 3.4 0.2 Iris-setosa
23 21 5.1 3.7 0.4 Iris-setosa
24 22 4.6 3.6 0.2 Iris-setosa
25 23 5.1 3.3 0.5 Iris-setosa
26 24 4.8 3.4 0.2 Iris-setosa
27 25 5.0 -3.0 0.2 Iris-setosa
28 26 5.0 3.4 0.4 Iris-setosa
29 27 5.2 3.5 0.2 Iris-setosa
30 28 5.2 3.4 0.2 Iris-setosa
31 29 4.7 3.2 0.2 Iris-setosa
32 30 4.8 3.1 1.6 0.2 Iris-setosa
33 31 5.4 3.4 1.5 0.4 Iris-setosa
34 32 5.2 4.1 1.5 0.1 Iris-setosa
35 33 5.5 4.2 1.4 0.2 Iris-setosa
36 34 4.9 3.1 1.5 0.1 Iris-setosa
37 35 5.0 3.2 1.2 0.2 Iris-setosa
38 36 5.5 3.5 1.3 0.2 Iris-setosa
39 37 4.9 1.5 0.1 Iris-setosa
40 38 4.4 3.0 1.3 0.2 Iris-setosa
41 39 5.1 3.4 1.5 0.2 Iris-setosa
42 40 5.0 3.5 1.3 0.3 Iris-setosa
43 41 4.5 2.3 1.3 0.3 Iris-setosa
44 42 4.4 3.2 1.3 0.2 Iris-setosa
45 43 5.0 3.5 1.6 0.6 Iris-setosa
46 44 5.1 3.8 1.9 0.4 Iris-setosa
47 45 4.8 3.0 1.4 0.3 Iris-setosa
48 46 5.1 3809.0 1.6 0.2 Iris-setosa
49 47 4.6 3.2 1.4 0.2 Iris-setosa
50 48 5.3 3.7 1.5 0.2 Iris-setosa
51 49 5.0 3.3 1.4 0.2 Iris-setosa
52 50 7.0 3.2 4.7 1.4 Iris-versicolor
53 51 6.4 3200.0 4.5 1.5 Iris-versicolor
54 52 6.9 3.1 4.9 1.5 Iris-versicolor
55 53 5.5 2.3 4.0 1.3 Iris-versicolor
56 54 6.5 2.8 4.6 1.5 Iris-versicolor
57 55 5.7 2.8 4.5 1.3 Iris-versicolor
58 56 6.3 3.3 4.7 1600.0 Iris-versicolor
59 57 4.9 2.4 3.3 1.0 Iris-versicolor
60 58 6.6 2.9 4.6 1.3 Iris-versicolor
61 59 5.2 2.7 3.9 Iris-versicolor
62 60 5.0 2.0 3.5 1.0 Iris-versicolor
63 61 5.9 3.0 4.2 1.5 Iris-versicolor
64 62 6.0 2.2 4.0 1.0 Iris-versicolor
65 63 6.1 2.9 4.7 1.4 Iris-versicolor
66 64 5.6 2.9 3.6 1.3 Iris-versicolor
67 65 6.7 3.1 4.4 1.4 Iris-versicolor
68 66 5.6 3.0 4.5 1.5 Iris-versicolor
69 67 5.8 2.7 4.1 1.0 Iris-versicolor
70 68 6.2 2.2 4.5 1.5 Iris-versicolor
71 69 5.6 2.5 3.9 1.1 Iris-versicolor
72 70 5.9 3.2 4.8 1.8 Iris-versicolor
73 71 6.1 2.8 4.0 1.3 Iris-versicolor
74 72 6.3 2.5 4.9 1.5 Iris-versicolor
75 73 6.1 2.8 4.7 1.2 Iris-versicolor
76 74 6.4 2.9 4.3 1.3 Iris-versicolor
77 75 6.6 3.0 4.4 1.4 Iris-versicolor
78 76 6.8 2.8 4.8 1.4 Iris-versicolor
79 77 6.7 3.0 5.0 1.7 Iris-versicolor
80 78 6.0 2.9 4.5 1.5 Iris-versicolor
81 79 5.7 2.6 3.5 1.0 Iris-versicolor
82 80 5.5 2.4 3.8 1.1 Iris-versicolor
83 81 5.5 2.4 3.7 1.0 Iris-versicolor
84 82 5.8 2.7 3.9 1.2 Iris-versicolor
85 83 6.0 2.7 5.1 1.6 Iris-versicolor
86 84 5.4 3.0 4.5 1.5 Iris-versicolor
87 85 6.0 3.4 4.5 1.6 Iris-versicolor
88 86 6.7 3.1 4.7 1.5 Iris-versicolor
89 87 6.3 2.3 4.4 1.3 Iris-versicolor
90 88 5.6 3.0 4.1 1.3 Iris-versicolor
91 89 5.5 2.5 4.0 1.3 Iris-versicolor
92 90 5.5 2.6 4.4 1.2 Iris-versicolor
93 91 6.1 3.0 4.6 1.4 Iris-versicolor
94 92 5.8 2.6 4.0 1.2 Iris-versicolor
95 93 5.0 2.3 3.3 1.0 Iris-versicolor
96 94 5.6 2.7 4.2 1.3 Iris-versicolor
97 95 5.7 3.0 4.2 1.2 Iris-versicolor
98 96 5.7 2.9 4.2 1.3 Iris-versicolor
99 97 6.2 2.9 4.3 1.3 Iris-versicolor
100 98 5.1 2.5 3.0 1.1 Iris-versicolor
101 99 5.7 2.8 1.3 Iris-versicolor
102 100 3.3 2.5 Iris-virginica
103 101 5.8 2.7 1.9 Iris-virginica
104 102 7.1 3.0 2.1 Iris-virginica
105 103 6.3 2.9 1.8 Iris-virginica
106 104 6.5 3.0 2.2 Iris-virginica
107 105 7.6 3.0 6.6 2.1 Iris-virginica
108 106 4.9 2.5 4.5 1.7 Iris-virginica
109 107 7.3 2.9 6.3 1.8 Iris-virginica
110 108 6.7 2.5 5.8 1.8 Iris-virginica
111 109 7.2 3.6 6.1 2.5 Iris-virginica
112 110 6.5 3.2 5.1 2.0 Iris-virginica
113 111 6.4 2.7 5.3 1.9 Iris-virginica
114 112 6.8 3.0 5.5 2.1 Iris-virginica
115 113 5.7 2.5 5.0 2.0 Iris-virginica
116 114 5.8 5.1 2.4 Iris-virginica
117 115 6.4 5.3 2.3 Iris-virginica
118 116 6.5 5.5 1.8 Iris-virginica
119 117 7.7 6.7 2.2 Iris-virginica
120 118 7.7 2.3 Iris-virginica
121 119 6.0 5.0 1.5 Iris-virginica
122 120 6.9 5.7 2.3 Iris-virginica
123 121 5.6 2.8 4.9 2.0 Iris-virginica
124 122 always check the data !!!!!!!!
125 123 6.3 2.7 4.9 1.8 Iris-virginica
126 124 6.7 3.3 5.7 2.1 Iris-virginica
127 125 7.2 3.2 6.0 1.8 Iris-virginica
128 126 6.2 2.8 -4.8 1.8 Iris-virginica
129 127 3.0 4.9 1.8 Iris-virginica
130 128 6.4 2.8 5.6 2.1 Iris-virginica
131 129 7.2 3.0 5.8 1.6 Iris-virginica
132 130 7.4 2.8 6.1 1.9 Iris-virginica
133 131 7.9 3.8 6.4 2.0 Iris-virginica
134 132 6.-4 2.8 5.6 2.2 Iris-virginica
135 133 6.3 2.8 1.5 Iris-virginica
136 134 6.1 2.6 5.6 1.4 Iris-virginica
137 135 7.7 3.0 6.1 2.3 Iris-virginica
138 136 6.3 3.4 5.6 2.4 Iris-virginica
139 137 6.4 3.1 5.5 1.8 Iris-virginica
140 138 6.0 3.0 4.8 1.8 Iris-virginica
141 139 6900 3.1 5.4 2.1 Iris-virginica
142 140 6.7 3.1 2.4 Iris-virginica
143 141 6.9 3.1 5.1 2.3 Iris-virginica
144 142 580 2.7 5.1 Iris-virginica
145 143 6.8 3.2 5.9 2.3 Iris-virginica
146 144 6.7 3.3 5.7 -2.5 Iris-virginica
147 145 6.7 3.0 5.2 2.3 Iris-virginica
148 146 6.3 2.5 5.0 1.9 Iris-virginica
149 147 6.5 3.0 5.2 2.0 Iris-virginica
150 148 6.2 3.4 5.4 2.3 Iris-virginica
151 149 5.9 3.0 5.1 1.8 Iris-virginica

152
subjects/ai/pandas/data/iris.data

@ -0,0 +1,152 @@
sepal_length,sepal_width,petal_length,petal_width, flower
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,-3.6,-1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
-4.4,2.9,1400,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5.4,3.7,1.5,0.2,Iris-setosa
4.8,3.4,1.6,0.2,Iris-setosa
4.8,3.0,1.4,0.1,Iris-setosa
4.3,3.0,1.1,0.1,Iris-setosa
5.8,4.0,1.2,0.2,Iris-setosa
5.7,4.4,1500,0.4,Iris-setosa
5.4,3.9,1.3,0.4,Iris-setosa
5.1,3.5,1.4,0.3,Iris-setosa
5.7,3.8,1.7,0.3,Iris-setosa
5.1,3.8,1.5,0.3,Iris-setosa
5.4,3.4,-1.7,0.2,Iris-setosa
5.1,3.7,1.5,0.4,Iris-setosa
4.6,3.6,1.0,0.2,Iris-setosa
5.1,3.3,1.7,0.5,Iris-setosa
4.8,3.4,1.9,0.2,Iris-setosa
5.0,-3.0,1.6,0.2,Iris-setosa
5.0,3.4,1.6,0.4,Iris-setosa
5.2,3.5,1.5,0.2,Iris-setosa
5.2,3.4,1.4,0.2,Iris-setosa
4.7,3.2,1.6,0.2,Iris-setosa
4.8,3.1,1.6,0.2,Iris-setosa
5.4,3.4,1.5,0.4,Iris-setosa
5.2,4.1,1.5,0.1,Iris-setosa
5.5,4.2,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5.0,3.2,1.2,0.2,Iris-setosa
5.5,3.5,1.3,0.2,Iris-setosa
4.9,,1.5,0.1,Iris-setosa
4.4,3.0,1.3,0.2,Iris-setosa
5.1,3.4,1.5,0.2,Iris-setosa
5.0,"3.5",1.3,0.3,Iris-setosa
4.5,2.3,1.3,0.3,Iris-setosa
4.4,3.2,1.3,0.2,Iris-setosa
5.0,3.5,1.6,0.6,Iris-setosa
5.1,3.8,1.9,0.4,Iris-setosa
4.8,3.0,1.4,0.3,Iris-setosa
5.1,3809,1.6,0.2,Iris-setosa
4.6,3.2,1.4,0.2,Iris-setosa
5.3,3.7,1.5,0.2,Iris-setosa
5.0,3.3,1.4,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
6.4,3200,4.5,1.5,Iris-versicolor
6.9,3.1,4.9,1.5,Iris-versicolor
5.5,2.3,4.0,1.3,Iris-versicolor
6.5,2.8,4.6,1.5,Iris-versicolor
5.7,2.8,4.5,1.3,Iris-versicolor
6.3,3.3,4.7,1600,Iris-versicolor
4.9,2.4,3.3,1.0,Iris-versicolor
6.6,2.9,4.6,1.3,Iris-versicolor
5.2,2.7,3.9,,Iris-versicolor
5.0,2.0,3.5,1.0,Iris-versicolor
5.9,3.0,4.2,1.5,Iris-versicolor
6.0,2.2,4.0,1.0,Iris-versicolor
6.1,2.9,4.7,1.4,Iris-versicolor
5.6,2.9,3.6,1.3,Iris-versicolor
6.7,3.1,4.4,1.4,Iris-versicolor
5.6,3.0,4.5,1.5,Iris-versicolor
5.8,2.7,4.1,1.0,Iris-versicolor
6.2,2.2,4.5,1.5,Iris-versicolor
5.6,2.5,3.9,1.1,Iris-versicolor
5.9,3.2,4.8,1.8,Iris-versicolor
6.1,2.8,4.0,1.3,Iris-versicolor
6.3,2.5,4.9,1.5,Iris-versicolor
6.1,2.8,4.7,1.2,Iris-versicolor
6.4,2.9,4.3,1.3,Iris-versicolor
6.6,3.0,4.4,1.4,Iris-versicolor
6.8,2.8,4.8,1.4,Iris-versicolor
6.7,3.0,5.0,1.7,Iris-versicolor
6.0,2.9,4.5,1.5,Iris-versicolor
5.7,2.6,3.5,1.0,Iris-versicolor
5.5,2.4,3.8,1.1,Iris-versicolor
5.5,2.4,3.7,1.0,Iris-versicolor
5.8,2.7,3.9,1.2,Iris-versicolor
6.0,2.7,5.1,1.6,Iris-versicolor
5.4,3.0,4.5,1.5,Iris-versicolor
6.0,3.4,4.5,1.6,Iris-versicolor
6.7,3.1,4.7,1.5,Iris-versicolor
6.3,2.3,4.4,1.3,Iris-versicolor
5.6,3.0,4.1,1.3,Iris-versicolor
5.5,2.5,4.0,1.3,Iris-versicolor
5.5,2.6,4.4,1.2,Iris-versicolor
6.1,3.0,4.6,1.4,Iris-versicolor
5.8,2.6,4.0,1.2,Iris-versicolor
5.0,2.3,3.3,1.0,Iris-versicolor
5.6,2.7,4.2,1.3,Iris-versicolor
5.7,3.0,4.2,1.2,Iris-versicolor
5.7,2.9,4.2,1.3,Iris-versicolor
6.2,2.9,4.3,1.3,Iris-versicolor
5.1,2.5,3.0,1.1,Iris-versicolor
5.7,2.8,4.1,1.3,Iris-versicolor
6.3,3.3,6.0,2.5,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
7.1,3.0,5.9,2.1,Iris-virginica
6.3,2.9,5.6,1.8,Iris-virginica
6.5,3.0,5.8,2.2,Iris-virginica
7.6,3.0,6.6,2.1,Iris-virginica
4.9,2.5,4.5,1.7,Iris-virginica
7.3,2.9,6.3,1.8,Iris-virginica
6.7,2.5,5.8,1.8,Iris-virginica
7.2,3.6,6.1,2.5,Iris-virginica
6.5,3.2,5.1,2.0,Iris-virginica
6.4,2.7,5.3,1.9,Iris-virginica
6.8,3.0,5.5,2.1,Iris-virginica
5.7,2.5,5.0,2.0,Iris-virginica
5.8,2.8,5.1,2.4,Iris-virginica
6.4,3.2,5.3,2.3,Iris-virginica
6.5,3.0,5.5,1.8,Iris-virginica
7.7,3.8,6.7,2.2,Iris-virginica
7.7,2.6,6.9,2.3,Iris-virginica
6.0,2.2,5.0,1.5,Iris-virginica
6.9,3.2,5.7,2.3,Iris-virginica
5.6,2.8,4.9,2.0,Iris-virginica
7.7,2.8,6.7,2.0,Iris-virginica
6.3,2.7,4.9,1.8,Iris-virginica
6.7,3.3,5.7,2.1,Iris-virginica
7.2,3.2,6.0,1.8,Iris-virginica
6.2,2.8,-4.8,1.8,Iris-virginica
6.1,3.0,4.9,1.8,Iris-virginica
6.4,2.8,5.6,2.1,Iris-virginica
7.2,3.0,5.8,1.6,Iris-virginica
7.4,2.8,6.1,1.9,Iris-virginica
7.9,3.8,6.4,2.0,Iris-virginica
6.-4,2.8,5.6,2.2,Iris-virginica
6.3,2.8,"5.1",1.5,Iris-virginica
6.1,2.6,5.6,1.4,Iris-virginica
7.7,3.0,6.1,2.3,Iris-virginica
6.3,3.4,5.6,2.4,Iris-virginica
6.4,3.1,5.5,1.8,Iris-virginica
6.0,3.0,4.8,1.8,Iris-virginica
6900,3.1,5.4,2.1,Iris-virginica
6.7,3.1,5.6,2.4,Iris-virginica
6.9,3.1,5.1,2.3,Iris-virginica
580,2.7,5.1,1.9,Iris-virginica
6.8,3.2,5.9,2.3,Iris-virginica
6.7,3.3,5.7,-2.5,Iris-virginica
6.7,3.0,5.2,2.3,Iris-virginica
6.3,2.5,5.0,1.9,Iris-virginica
6.5,3.0,5.2,2.0,Iris-virginica
6.2,3.4,5.4,2.3,Iris-virginica
5.9,3.0,5.1,1.8,Iris-virginica

212
subjects/ai/sp500-strategies/README.md

@ -0,0 +1,212 @@
# Financial strategies on the SP500
In this project we will apply machine to finance. You are a Quant/Data Scientist and your goal is to create a financial strategy based on a signal outputted by a machine learning model that overperforms the [SP500](https://en.wikipedia.org/wiki/S%26P_500).
The Standard & Poors 500 Index is a collection of stocks intended to reflect the overall return characteristics of the stock market as a whole. The stocks that make up the S&P 500 are selected by market capitalization, liquidity, and industry. Companies to be included in the S&P are selected by the S&P 500 Index Committee, which consists of a group of analysts employed by Standard & Poor's.
The S&P 500 Index originally began in 1926 as the "composite index" comprised of only 90 stocks. According to historical records, the average annual return since its inception in 1926 through 2018 is approximately 10%–11%.The average annual return since adopting 500 stocks into the index in 1957 through 2018 is roughly 8%.
As a Quant Researcher, you may beat the SP500 one year or few years. The real challenge though is to beat the SP500 consistently over decades. That's what most hedge funds in the world are trying to do.
The project is divided in parts:
- **Data processing and feature engineering**: Build a dataset: insightful features and the target
- **Machine Learning pipeline**: Train machine learning models on the dataset, select the best model and generate the machine learning signal.
- **Strategy backtesting**: Generate a strategy from the Machine Learning model output and backtest the strategy. As a reminder, the idea here is to see what would have performed the strategy if you would have invested.
### Deliverables
Do not forget to check the ressources of W1D5 and espcially W1D5E4.
### Data processing and features engineering
The first file contains SP500 index data (OHLC: 4 time-series) and the other file contains the OHLCV data on the SP500 contituents.
- Split the data in train and test. The test set should set from **2017** .
- Your first priority is to build a dataset without leakage !!! NO LEAKAGE !!!
Note: Financial data can be complex and tricky to analyse for a lot of reasons. In order to focus on Time Series forecasting, the project gives access to a "simplified" financial dataset. For instance, we consider the composition of the SP500 remains similar over time which is not true and which introduces a "survivor bias". Plus, the data during covid-19 was removed because it may have a significant impact on the backtesting.
**"No leakage" [intro](<https://en.wikipedia.org/wiki/Leakage_(machine_learning)>) and small guide:**
We assume it is day D and we want to take a position on the next h days on the next day. The position starts on day D+1 (included). To decide wether we take a short or long position the return between day D+1 and D+2 is computed and used as a target. Finally, as features on day contain information until day D 11:59pm, target need to be shifted. As a result, the final dataframe schema is:
| Index | Features | Target |
| ------- | :------------------------: | ---------------: |
| Day D-1 | Features until D-1 23:59pm | return(D, D+1) |
| Day D | Features until D 23:59pm | return(D+1, D+2) |
| Day D+1 | Features until D+1 23:59pm | return(D+2, D+3) |
**Note: This table is simplified, the index of your DataFrame is a multi-index with date and ticker.**
- Features: - Bollinger - RSI - MACD
**Note: you can use any library to compute these features, you don't need to implement all financial features from scratch.**
- Target:
- On day D, the target is: **sign(return(D+1, D+2))**
> Remark: The target used is the return computed on the price and not the price directly. There are statistical reasons for this choice - the price is not stationary. The consequence is that a machine learning model tends to overfit while training on not stationary data.
### Machine learning pipeline
- Cross-validation deliverables:
- Implements a cross validation with at least 10 folds. The train set has to be bigger than 2 years history.
- Two types of temporal cross-validations are required:
- Blocking (plot below)
- Time Series split (plot below)
- Make sure the last fold of the train set does not overlap on the test set.
- Make sure the folds do not contain data from the same day. The data should be split on the dates.
- Plot your cross validation as follow:
![alt text][blocking]
[blocking]: blocking_time_series_split.png "Blocking Time Series split"
![alt text][timeseries]
[timeseries]: Time_series_split.png "Time Series split"
Once you'll have run the gridsearch on the cross validation (choose either Blocking or Time Series split), you'll select the best pipeline on the train set and save it as `selected_model.pkl` and `selected_model.txt` (pipeline hyper-parameters).
**Note: You may observe that the selected model is not good after analyzing the ml metrics (ON THE TRAIN SET) and select another one. **
- ML metrics and feature importances on the selected pipeline on the train set only.
- DataFrame with a Machine learning metrics on train et validation sets on all folds of the train set. Suggested format: columns: ML metrics (AUC, Accuracy, LogLoss), rows: folds, train set and validation set (double index). Save it as `ml_metrics_train.csv`
- Plot. Choose the metric you want. Suggested: AUC Save it as `metric_train.png`. The plot below shows how the plot should look like.
- DataFrame with top 10 important features for each fold. Save it as `top_10_feature_importance.csv`
![alt text][barplot]
[barplot]: metric_plot.png "Metric plot"
- The signal has to be generated with the chosen cross validation: train the model on the train set of the first fold, then predict on its validation set; train the model on the train set of the second fold, then predict on its validation set, etc ... Then, concatenate the predictions on the validation sets to build the machine learning signal. **The pipeline shouldn't be trained once and predict on all data points !**
**The output is a DataFrame or Series with a double index ordered with the probability the stock price for asset i increases between d+1 and d+2.**
- (optional): [Train a RNN/LSTM](https://towardsdatascience.com/predicting-stock-price-with-lstm-13af86a74944). This a nice way to discover and learn about recurrent neural networks. But keep in mind that there are some new neural network architectures that seem to outperform recurrent neural networks: https://towardsdatascience.com/the-fall-of-rnn-lstm-2d1594c74ce0.
### Strategy backtesting
- Backtesting module deliverables. The module takes as input a machine learning signal, convert it into a financial strategy. A financial strategy DataFrame gives the amount invested at time t on asset i. The module returns the following metrics on the train set and the test set.
- PnL plot: save it as `strategy.png`
- x axis: date
- y axis1: PnL of the strategy at time t
- y axis2: PnL of the SP500 at time t
- Use the same scale for y axis1 and y axis2
- add a line that shows the separation between train set and test set
- Pnl
- Max drawdown. https://www.investopedia.com/terms/d/drawdown.asp
- (Optional): add other metrics as sharpe ratio, volatility, etc ...
- Create a markdown report that explains and save it as `report.md`:
- the features used
- the pipeline used
- imputer
- scaler
- dimension reduction
- model
- the cross-validation used
- length of train sets and validation sets
- cross-validation plot (optional)
- strategy chosen
- description
- PnL plot
- strategy metrics on the train set and test set
### Example of strategies:
- Long only:
- Binary signal:
0: do nothing for one day on asset i
1: take a long position on asset i for 1 day
- Weights proportional to the machine learning signals
- invest x on asset i for on day
- Long and short: For those who search long short strategy on Google, don't get wrong, this has nothing to do with pair trading.
- Binary signal:
- -1: take a short position on asset i for 1 day
- 1: take a long position on asset i for 1 day
- Ternary signal:
- -1: take a short position on asset i for 1 day
- 0: do nothing for one day on asset i
- 1: take a long position on asset i for 1 day
Notes:
- Warning! When you don't invest on all stock as in the binary signal or the ternary signal, make sure that you are still investing 1$ per day!
- In order to simplify the **short position** we consider that this is the opposite of a long position. Example: I take a short one AAPL stock and the price decreases by 20$ on one day. I earn 20$.
- Stock picking: Take a long position on the k best assets (from the machine learning signal) and short the k worst assets regarding the machine learning signal.
Here's an example on how to convert a machine learning signal into a financial strategy:
- Input:
| Date | Ticker | Machine Learning signal |
| ------- | :----: | ----------------------: |
| Day D-1 | AAPL | 0.55 |
| Day D-1 | C | 0.36 |
| Day D | AAPL | 0.59 |
| Day D | C | 0.33 |
| Day D+1 | AAPL | 0.61 |
| Day D+1 | C | 0.33 |
- Convert it into a binary long only strategy:
- Machine learning signal > 0.5
| Date | Ticker | Binary signal |
| ------- | :----: | ------------: |
| Day D-1 | AAPL | 1 |
| Day D-1 | C | 0 |
| Day D | AAPL | 1 |
| Day D | C | 0 |
| Day D+1 | AAPL | 1 |
| Day D+1 | C | 0 |
!!! BE CAREFUL !!!THIS IS EXTREMELY IMPORTANT.
- Multiply it with the associated return.
Don't forget the meaning of the signal on day d: it gives the return between d+1 and d+2. You should multiply the binary signal of day by the return computed between d+1 and d+2. Otherwise it's wrong because you use your signal that gives you information on d+1 and d+2 on the past or present. The strategy is leaked !
**Assumption**: you have 1$ per day to invest in your strategy.
### Project repository structure:
```
project
│ README.md
│ environment.yml
└───data
│ │ sp500.csv
└───results
│ │
| |───cross-validation
│ │ │ ml_metrics_train.csv
│ │ │ metric_train.csv
│ │ │ top_10_feature_importance.csv
│ │ │ metric_train.png
│ │
| |───selected model
│ │ │ selected_model.pkl
│ │ │ selected_model.txt
│ │ │ ml_signal.csv
│ │
| |───strategy
| | | strategy.png
│ │ │ results.csv
│ │ │ report.md
|
|───scripts (free format)
│ │ features_engineering.py
│ │ gridsearch.py
│ │ model_selection.py
│ │ create_signal.py
│ │ strategy.py
```
Note: `features_engineering.py` can be used in `gridsearch.py`
### Files for this project
You can find the data required for this project in this [link]:(https://assets.01-edu.org/ai-branch/project4/project04-20221031T173034Z-001.zip)

BIN
subjects/ai/sp500-strategies/Time_series_split.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 61 KiB

142
subjects/ai/sp500-strategies/audit/README.md

@ -0,0 +1,142 @@
#### Financial strategies on the SP500
This documents is the correction of the project 4. Some steps are detailed in W1D5E4.
```
project
│ README.md
│ environment.yml
└───data
│ │ sp500.csv
└───results
│ │
| |───cross-validation
│ │ │ ml_metrics_train.csv
│ │ │ metric_train.csv
│ │ │ top_10_feature_importance.csv
│ │ │ metric_train.png
│ │
| |───selected model
│ │ │ selected_model.pkl
│ │ │ selected_model.txt
│ │ │ ml_signal.csv
│ │
| |───strategy
| | | strategy.png
│ │ │ results.csv
│ │ │ report.md
|
|───scripts (free format)
│ │ features_engineering.py
│ │ gridsearch.py
│ │ model_selection.py
│ │ create_signal.py
│ │ strategy.py
```
###### Does the structure of the project is as below ?
###### Does the readme file summurize how to run the code and explain the global approach ?
###### Does the environment contain all libraries used and their versions that are necessary to run the code ?
###### Do the text files explain the chosen model methodology ?
##### **Data processing and feature engineering**
###### Is the data splitted in a train set and test set ?
###### Is the last day of the train set is D and the first day of the test set is D+n with n>0 ? Splitting without considering the time series structure is wrong.
##### There is no leakage: unfortunately there's no autamated way to check if the dataset is leaked. This step is validated if the features of date d are built as follow:
| Index | Features | Target |
| ------- | :------------------------: | ---------------: |
| Day D-1 | Features until D-1 23:59pm | return(D, D+1) |
| Day D | Features until D 23:59pm | return(D+1, D+2) |
| Day D+1 | Features until D+1 23:59pm | return(D+2, D+3) |
###### Have the features been grouped by ticker before to compute the features ?
###### - Has the target been grouped by ticker before to compute the futur returns ?
##### **Machine Learning pipeline**
##### Cross-Validation
###### Does the CV contain at least 10 folds in total ?
###### Do all train folds have more than 2y history ? If you use time series split, checking that the first fold has more than 2y history is enough.
##### The last validation set of the train set doesn't overlap on the test set.
##### None of the folds contain data from the same day.The split should be done on the dates.
##### There's a plot showing your cross-validation. As usual, all plots should have named axis and a title.If you chose a Time Series Split the plot should look like this:
![alt text][timeseries]
[timeseries]: ../Time_series_split.png "Time Series split"
##### Model Selection
##### The test set hasn't been used to train the model and select the model.
###### Is the selected model saved in the pkl file and described in a txt file ?
##### Selected model
##### The ml metrics computed on the train set are agregated: sum or median.
###### Are the ml metrics saved in a csv file ?
###### Are the top 10 important features per fold are saved in `top_10_feature_importance.csv`?
###### Does `metric_train.png` show a plot similar to the one below ?
_Note that, this can be done also on the test set **IF** this hasn't helped to select the pipeline. _
![alt text][barplot]
[barplot]: ../metric_plot.png "Metric plot"
##### Machine learning signal
##### **The pipeline shouldn't be trained once and predict on all data points !** As explained: The signal has to be generated with the chosen cross validation: train the model on the train set of the first fold, then predict on its validation set; train the model on the train set of the second fold, then predict on its validation set, etc ... Then, concatenate the predictions on the validation sets to build the machine learning signal.
##### **Strategy backtesting**
##### Convert machine learning signal into a strategy
##### The transformed machine learning signal (long only, long short, binary, ternary, stock picking, proportional to probability or custom ) is multiplied by the return between d+1 and d+2. As a reminder, the signal at date d predicts wether the return between d+1 and d+2 is increasing or deacreasing. Then, the PnL of date d could be associated with date d, d+1 or d+2. This is arbitrary and should impact the value of the PnL.
##### You invest the same amount of money every day. One exception: if you invest 1$ per day per stock the amount invested every day may change depending on the strategy chosen. If you take into account the different values of capital invested every day in the calculation of the PnL, the step is still validated.
##### Metrics and plot
###### Is the Pnl computed as: strategy \* futur_return ?
###### Does the strategy give the amount invested at time t on asset i ?
###### Does the plot `strategy.png` contains an x axis: date ?
###### Does the plot `strategy.png` contains a y axis1: PnL of the strategy at time t ?
###### Does the plot `strategy.png` contains a y axis2: PnL of the SP500 at time t ?
###### Does the plot `strategy.png` use the same scale for y axis1 and y axis2 ?
###### Does the plot `strategy.png` contains a vertical line that shows the separation between train set and test set ?
##### Report
###### Does the report detail the features used ?
###### Does the report detail the pipeline used (imputer, scaler, dimension reduction and model) ?
###### Does the report detail the cross-validation used (length of train sets and validation sets and if possible the cross-validation plot) ?
###### Does the report detail the strategy chosen (description, PnL plot and the strategy metrics on the train set and test set) ?

BIN
subjects/ai/sp500-strategies/blocking_time_series_split.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 68 KiB

BIN
subjects/ai/sp500-strategies/metric_plot.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 56 KiB

172
subjects/ai/time-series-with-pandas/README.md

@ -0,0 +1,172 @@
# Time Series with Pandas
Time series data are data that are indexed by a sequence of dates or times. Today, you'll learn how to use methods built into Pandas to work with this index. You'll also learn for instance:
- to resample time series to change the frequency
- to calculate rolling and cumulative values for times series
- to build a backtest
Time series a used A LOT in finance. You'll learn to evaluate financial strategies using Pandas. It is important to keep in mind that Python is vectorized. That's why some questions constraint you to not use a for loop ;-).
### Exercises of the day
- Exercise 0: Environment and libraries
- Exercise 1: Series
- Exercise 2: Financial data
- Exercise 3: Multi asset returns
- Exercise 4: Backtest
### Virtual Environment
- Python 3.x
- NumPy
- Pandas
- Jupyter or JupyterLab
_Version of Pandas I used to do the exercises: 1.0.1_.
I suggest to use the most recent one.
### Resources
- https://jakevdp.github.io/PythonDataScienceHandbook/
- https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
- https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/
- https://towardsdatascience.com/different-ways-to-iterate-over-rows-in-a-pandas-dataframe-performance-comparison-dc0d5dcef8fe
---
---
# Exercise 0: Environment and libraries
The goal of this exercise is to set up the Python work environment with the required libraries.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
I recommend to use:
- the **last stable versions** of Python.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy` and `jupyter`.
---
---
# Exercise 1: Series
The goal of this exercise is to learn to manipulate time series in Pandas.
1. Create a `Series` named `integer_series` from 1st January 2010 to 31 December 2020. At each date is associated the number of days since 1st January 2010. It starts with 0.
2. Using Pandas, compute a 7 days moving average. This transformation smooths the time series by removing small fluctuations. **without for loop**
---
---
# Exercise 2: Financial data
The goal of this exercise is to learn to use Pandas on Time Series an on Financial data.
The data we will use is Apple stock.
1. Using `Plotly` plot a Candlestick
2. Aggregate the data to **last business day of each month**. The aggregation should consider the meaning of the variables. How many months are in the considered period ?
3. When comparing many stocks between them the metric which is frequently used is the return of the price. The price is not a convenient metric as the prices evolve in different ranges. The return at time t is defined as
- (Price(t) - Price(t-1))/ Price(t-1)
Using the open price compute the **daily return**. Propose two different ways **without for loop**.
---
---
# Exercise 3: Multi asset returns
The goal of this exercise is to learn to compute daily returns on a DataFrame that contains many assets (multi-assets).
```python
business_dates = pd.bdate_range('2021-01-01', '2021-12-31')
#generate tickers
tickers = ['AAPL', 'FB', 'GE', 'AMZN', 'DAI']
#create indexs
index = pd.MultiIndex.from_product([business_dates, tickers], names=['Date', 'Ticker'])
# create DFs
market_data = pd.DataFrame(index=index,
data=np.random.randn(len(index), 1),
columns=['Price'])
```
1. **Without using a for loop**, compute the daily returns (return(d) = (price(d)-price(d-1))/price(d-1)) for all the companies and returns a DataFrame as:
| Date | ('Price', 'AAPL') | ('Price', 'AMZN') | ('Price', 'DAI') | ('Price', 'FB') | ('Price', 'GE') |
| :------------------ | ----------------: | ----------------: | ---------------: | --------------: | --------------: |
| 2021-01-01 00:00:00 | nan | nan | nan | nan | nan |
| 2021-01-04 00:00:00 | 1.01793 | 0.0512955 | 3.84709 | -0.503488 | 0.33529 |
| 2021-01-05 00:00:00 | -0.222884 | -1.64623 | -0.71817 | -5.5036 | -4.15882 |
Note: The data is generated randomly, the values you may have a different results. But, this shows the expected DataFrame structure.
`Hint use groupby`
---
---
# Exercise 4: Backtest
The goal of this exercise is to learn to perform a backtest in Pandas. A backtest is a tool that allows you to know how a strategy would have performed retrospectively using historical data. In this exercise we will focus on the backtesting tool and not on how to build the best strategy.
We will backtest a **long only** strategy on Apple Inc. Long only means that we only consider buying the stock. The input signal at date d says if the close price will increase at d+1. We assume that the input signal is available before the market closes.
1. Drop the rows with missing values and compute the daily futur return on the Apple stock (`AAPL.csv`) on the adjusted close price. The daily futur return means: **Return(t) = (Price(t+1) - Price(t))/Price(t)**.
There are some events as splits or dividents that artificially change the price of the stock. That is why the close price is adjusted to avoid to have outliers in the price data.
2. Create a Series that contains a random boolean array with **p=0.5**
```console
Here an example of the expected time series
2010-01-01 1
2010-01-02 0
2010-01-03 0
2010-01-04 1
2010-01-05 0
Freq: D, Name: long_only_signal, dtype: int64
```
- The information is this series should be interpreted this way:
- On the 2010-01-01 I receive `1` before the market closes meaning that, if I trust the signal, the close price of day d+1 will increase. I should buy the stock before the market closes.
- On the 2010-01-02 I receive `0` before the market closes meaning that,, if I trust the signal, the close price of day d+1 will not increase. I should not buy the stock.
3. Backtest the signal created in Question 2. Here are some assumptions made to backtest this signal:
- When, at date d, the signal equals 1 we buy 1$ of stock just before the market closes and we sell the stock just before the market closes the next day.
- When, at date d, the signal equals 0, we do not buy anything.
- The profit is not reinvested, when invested, the amount is always 1$.
- Fees are not considered
**The expected output** is a **Series that gives for each day the return of the strategy. The return of the strategy is the PnL (Profit and Losses) divided by the invested amount**. The PnL for day d is:
`(money earned this day - money invested this day)`
Let's take the example of a 20% return for an invested amount of 1$. The PnL is `(1,2 - 1) = 0.2`. We notice that the PnL when the signal is 1 equals the daily return. The Pnl when the signal is 0 is 0.
By convention, we consider that the PnL of d is affected to day d and not d+1, even if the underlying return contains the information of d+1.
**The usage of for loop is not allowed**.
4. Compute the return of the strategy. The return of the strategy is defined as: `(Total earned - Total invested) / Total invested`
5. Now the input signal is: **always buy**. Compute the daily PnL and the total PnL. Plot the daily PnL of Q5 and of Q3 on the same plot
- https://www.investopedia.com/terms/b/backtesting.asp

185
subjects/ai/time-series-with-pandas/audit/README.md

@ -0,0 +1,185 @@
#### Exercise 0: Environment and libraries
##### The exercice is validated is all questions of the exercice are validated.
##### Activate the virtual environment. If you used `conda` run `conda activate your_env`.
##### Run `python --version`.
###### Does it print `Python 3.x`? x >= 8
##### Does `import jupyter`, `import numpy` and `import pandas` run without any error?
---
---
#### Exercise 1: Series
##### The exercise is validated is all questions of the exercise are validated.
##### The question 1 is validated if the output of is as below. The best solution uses `pd.date_range` to generate the index and `range` to generate the integer series.
```console
2010-01-01 0
2010-01-02 1
2010-01-03 2
2010-01-04 3
2010-01-05 4
...
2020-12-27 4013
2020-12-28 4014
2020-12-29 4015
2020-12-30 4016
2020-12-31 4017
Freq: D, Name: integer_series, Length: 4018, dtype: int64
```
##### This question is validated if the output is as below. If the `NaN` values have been dropped the solution is also accepted. The solution uses `rolling().mean()`.
```console
2010-01-01 NaN
2010-01-02 NaN
2010-01-03 NaN
2010-01-04 NaN
2010-01-05 NaN
...
2020-12-27 4010.0
2020-12-28 4011.0
2020-12-29 4012.0
2020-12-30 4013.0
2020-12-31 4014.0
Freq: D, Name: integer_series, Length: 4018, dtype: float64
```
---
---
#### Exercise 2: Financial data
##### The exercise is validated is all questions of the exercise are validated.
###### Have the missing values and data types been checked ?
###### Have the string dates been converted to datetime type ?
###### Have the dates been set as index ?
###### Have `info` or/and `describe` been used to have a first look at the data ?
##### The question 1 is validated if the right columns are inserted in `Candlestick` `Plotly` object. The Candlestick is based on Open, High, Low and Close columns. The index is Date (datetime).
##### This question 2 is validated if the output of `print(transformed_df.head().to_markdown())` is as below and if there are **482 months**.
| Date | Open | Close | Volume | High | Low |
| :------------------ | -------: | -------: | ----------: | -------: | -------: |
| 1980-12-31 00:00:00 | 0.136075 | 0.135903 | 1.34485e+09 | 0.161272 | 0.112723 |
| 1981-01-30 00:00:00 | 0.141768 | 0.141316 | 6.08989e+08 | 0.155134 | 0.126116 |
| 1981-02-27 00:00:00 | 0.118215 | 0.117892 | 3.21619e+08 | 0.128906 | 0.106027 |
| 1981-03-31 00:00:00 | 0.111328 | 0.110871 | 7.00717e+08 | 0.120536 | 0.09654 |
| 1981-04-30 00:00:00 | 0.121811 | 0.121545 | 5.36928e+08 | 0.131138 | 0.108259 |
To get this result there are two ways: `resample` and `groupby`. There are two key steps:
- Find how to affect the aggregation on the last **business** day of each month. This is already implemented in Pandas and the keyword that should be used either in `resample` parameter or in `Grouper` is `BM`.
- Choose the right aggregation function for each variable. The prices (Open, Close and Adjusted Close) should be aggregated by taking the `mean`. Low should be aggregated by taking the `minimum` because it represents the lower price of the day, so the lowest price on the month is the lowest price of the lowest prices on the day. The same logic applied to High, leads to use the `maximum` to aggregate the High. Volume should be aggregated using the `sum` because the monthly volume is equal to the sum of daily volume over the month.
##### The question 3 is validated if it doesn't involve a for loop and the output is as below. The first way to do it is to compute the return without for loop is to use `pct_change`. And the second way to do it is to implement the formula given in the exercise in a vectorized way. To get the value at `t-1` the data has to be shifted with `shift`.
```console
Date
1980-12-12 NaN
1980-12-15 -0.047823
1980-12-16 -0.073063
1980-12-17 0.019703
1980-12-18 0.028992
...
2021-01-25 0.049824
2021-01-26 0.003704
2021-01-27 -0.001184
2021-01-28 -0.027261
2021-01-29 -0.026448
Name: Open, Length: 10118, dtype: float64
```
---
---
#### Exercise 3: Multi asset returns
##### This question is validated if, without having used a for loop, the outputted DataFrame shape's `(261, 5)` and the output is the same as the one return with this line of code. The DataFrame contains random data. Make sure the output and the one returned by this code is based on the same DataFrame.
```python
market_data.loc[market_data.index.get_level_values('Ticker')=='AAPL'].sort_index().pct_change()
```
---
---
#### Exercise 4: Backtest
##### The exercise is validated is all questions of the exercise are validated.
###### Have the missing values and data types been checked?
###### Have the string dates been converted to datetime type?
###### Have the dates been set as index?
###### Have `info` or/and `describe` been used to have a first look at the data?
**My results can be reproduced using: `np.random.seed = 2712`. Given the versions of NumPy used I do not guaranty the reproducibility of the results - that is why I also explain the steps to get to the solution.**
##### The question 1 is validated if the return is computed as: Return(t) = (Price(t+1) - Price(t))/Price(t) and returns this output. Note that if the index is not ordered in ascending order the futur return computed is wrong. The answer is also accepted if the returns is computed as in the exercise 2 and then shifted in the futur using `shift`, but I do not recommend this implementation as it adds missing values!
```console
Date
1980-12-12 -0.052170
1980-12-15 -0.073403
1980-12-16 0.024750
1980-12-17 0.029000
1980-12-18 0.061024
...
2021-01-25 0.001679
2021-01-26 -0.007684
2021-01-27 -0.034985
2021-01-28 -0.037421
2021-01-29 NaN
Name: Daily_futur_returns, Length: 10118, dtype: float64
```
An example of solution is:
```python
def compute_futur_return(price):
return (price.shift(-1) - price)/price
compute_futur_return(df['Adj Close'])
```
##### The question 2 is validated if the index of the Series is the same as the index of the DataFrame. The data of the series can be generated using `np.random.randint(0,2,len(df.index)`.
##### This question is validated if the Pnl is computed as: signal \* futur_return. Both series should have the same index.
```console
Date
1980-12-12 -0.052170
1980-12-15 -0.073403
1980-12-16 0.024750
1980-12-17 0.029000
1980-12-18 0.061024
...
2021-01-25 0.001679
2021-01-26 -0.007684
2021-01-27 -0.034985
2021-01-28 -0.037421
2021-01-29 NaN
Name: PnL, Length: 10119, dtype: float64
```
##### The question 4 is validated if the return of the strategy is computed as: `(Total earned - Total invested) / Total` invested. The result should be close to 0. The formula given could be simplified as `(PnLs.sum())/signal.sum()`. My return is: 0.00043546984088551553 because I invested 5147$ and I earned 5149$.
##### The question is validated if the previous signal Series is replaced with 1s. Similarly as the previous question, we earned 10128$ and we invested 10118$ which leads to a return of 0.00112670194140969 (0.1%).

10120
subjects/ai/time-series-with-pandas/data/AAPL.csv

File diff suppressed because it is too large diff.load

300
subjects/ai/train-and-evalute-machine-learning-models/README.md

@ -0,0 +1,300 @@
# Train and evaluate Machine Learning models
Today we will learn how to train and evaluate a machine learning model. You'll learn how tochoose the right Machine Learning metric depending on the problem you are solving and to compute it. A metric gives an idea of how good the model performs. Depending on working on a classification problem or a regression problem the metrics considered are different. It is important to understand that all metrics are just metrics, not the truth.
We will focus on the most important metrics:
- Regression:
- **R2**, **Mean Square Error**, **Mean Absolute Error**
- Classification:
- **F1 score**, **accuracy**, **precision**, **recall** and **AUC scores**. Even if it not considered as a metric, the **confusion matrix** is always useful to understand the model performance.
Warning: **Imbalanced data set**
Let us assume we are predicting a rare event that occurs less than 2% of the time. Having a model that scores a good accuracy is easy, it doesn't have to be "smart", all it has to do is to always predict the majority class. Depending on the problem it can be disastrous. For example, working with real life data, breast cancer prediction is an imbalanced problem where predicting the majority leads to disastrous consequences. That is why metrics as AUC are useful. Before to compute the metrics, read carefully this article to understand the role of these metrics.
You'll learn to train other types of Machine Learning models than linear regression and logistic regression. You're not supposed to spend time understanding the theory. I recommend to do that during the projects. Today, read the Scikit-learn documentation to have a basic understanding of the models you use. Focus on how to use correctly those Machine Learning models with Scikit-learn.
You'll also learn what is a grid-search and how to use it to train your machine learning models.
### Exercises of the day
- Exercise 0: Environment and libraries
- Exercise 1: MSE Scikit-learn
- Exercise 2: Accuracy Scikit-learn
- Exercise 3: Regression
- Exercise 4: Classification
- Exercise 5: Machine Learning models
- Exercise 6: Grid Search
### Virtual Environment
- Python 3.x
- NumPy
- Pandas
- Matplotlib
- Scikit Learn
- Jupyter or JupyterLab
_Version of Scikit Learn I used to do the exercises: 0.22_. I suggest to use the most recent one. Scikit Learn 1.0 is finally available after ... 14 years.
### Resources
### Metrics
- https://www.kdnuggets.com/2018/06/right-metric-evaluating-machine-learning-models-2.html
### Imbalance datasets
- https://stats.stackexchange.com/questions/260164/auc-and-class-imbalance-in-training-test-dataset
### Gridsearch
- https://medium.com/fintechexplained/what-is-grid-search-c01fe886ef0a
---
---
# Exercise 0: Environment and libraries
The goal of this exercise is to set up the Python work environment with the required libraries.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
I recommend to use:
- the **last stable versions** of Python.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy`, `jupyter`, `matplotlib` and `scikit-learn`.
---
---
# Exercise 1: MSE Scikit-learn
The goal of this exercise is to learn to use `sklearn.metrics` to compute the mean squared error (MSE).
1. Compute the MSE using `sklearn.metrics` on `y_true` and `y_pred` below:
```python
y_true = [91, 51, 2.5, 2, -5]
y_pred = [90, 48, 2, 2, -4]
```
---
---
# Exercise 2: Accuracy Scikit-learn
The goal of this exercise is to learn to use `sklearn.metrics` to compute the accuracy.
1. Compute the accuracy using `sklearn.metrics` on `y_true` and `y_pred` below:
```python
y_pred = [0, 1, 0, 1, 0, 1, 0]
y_true = [0, 0, 1, 1, 1, 1, 0]
```
---
---
# Exercise 3: Regression
The goal of this exercise is to learn to evaluate a machine learning model using many regression metrics.
Preliminary:
- Import California Housing data set and split it in a train set and a test set (10%). Fit a linear regression on the data set. _The goal is focus on the metrics, that is why the code to fit the Linear Regression is given._
```python
# imports
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
# data
housing = fetch_california_housing()
X, y = housing['data'], housing['target']
# split data train test
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.1,
shuffle=True,
random_state=13)
# pipeline
pipeline = [('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
('lr', LinearRegression())]
pipe = Pipeline(pipeline)
# fit
pipe.fit(X_train, y_train)
```
1. Predict on the train set and test set
2. Compute R2, Mean Square Error, Mean Absolute Error on both train and test set
---
---
# Exercise 4: Classification
The goal of this exercise is to learn to evaluate a machine learning model using many classification metrics.
Preliminary:
- Import Breast Cancer data set and split it in a train set and a test set (20%). Fit a logistic regression on the data set. _The goal is focus on the metrics, that is why the code to fit the logistic Regression is given._
```python
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X , y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
classifier = LogisticRegression()
classifier.fit(X_train_scaled, y_train)
```
1. Predict on the train set and test set
2. Compute F1, accuracy, precision, recall, roc_auc scores on the train set and test set. Print the confusion matrix on the test set results.
**Note: AUC can only be computed on probabilities, not on classes.**
3. Plot the AUC curve for on the test set using roc_curve of scikit learn. There many ways to create this plot. It should look like this:
![alt text][logo_ex4]
[logo_ex4]: ./w2_day4_ex4_q3.png "ROC AUC "
- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_roc_curve.html
---
---
# Exercise 5: Machine Learning models
The goal of this exercise is to have an overview of the existing Machine Learning models and to learn to call them from scikit learn.
We will focus on:
- SVM/SVC
- Decision Tree
- Random Forest (Ensemble learning)
- Gradient Boosting (Ensemble learning, Boosting techniques)
All these algorithms exist in two versions: regression and classification. Even if the logic is similar in both classification and regression, the loss function is specific to each case.
It is really easy to get lost among all the existing algorithms. This article is very useful to have a clear overview of the models and to understand which algorithm use and when. https://towardsdatascience.com/how-to-choose-the-right-machine-learning-algorithm-for-your-application-1e36c32400b9
Preliminary:
- Import California Housing data set and split it in a train set and a test set (10%). Fit a linear regression on the data set. _The goal is to focus on the metrics, that is why the code to fit the Linear Regression is given._
```python
# imports
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
# data
housing = fetch_california_housing()
X, y = housing['data'], housing['target']
# split data train test
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.1,
shuffle=True,
random_state=43)
# pipeline
pipeline = [('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
('lr', LinearRegression())]
pipe = Pipeline(pipeline)
# fit
pipe.fit(X_train, y_train)
```
1. Create 5 pipelines with 5 different models as final estimator (keep the imputer and scaler unchanged):
1. Linear Regression
2. SVM
3. Decision Tree (set `random_state=43`)
4. Random Forest (set `random_state=43`)
5. Gradient Boosting (set `random_state=43`)
Take time to have basic understanding of the role of the basic hyperparameter and their default value.
- For each algorithm, print the R2, MSE and MAE on both train set and test set.
---
---
# Exercise 6: Grid Search
The goal of this exercise is to learn how to make an exhaustive search over specified parameter values for an estimator. This is very useful because the hyperparameter which are the parameters of the model impact the performance of the model.
The scikit learn object that runs the Grid Search is called GridSearchCV. We will learn tomorrow about the cross validation. For now, let us set the parameter **cv** to `[(np.arange(18576), np.arange(18576,20640))]`.
This means that GridSearchCV splits the data set in a train and test set.
Preliminary:
- Load the California Housing data set. As precised, this time, there's no need to split the data set in train set and test set since GridSearchCV does it.
You will have to run a Grid Search on the Random Forest on at least the hyperparameter that are mentioned below. It doesn't mean these are the only hyperparameter of the model. If possible, try at least 3 different values for each hyperparameter.
1. Run a Grid Search with `n_jobs` set to `-1` to parallelize the computations on all CPUs. The hyperparameter to change are: n_estimators, max_depth, min_samples_leaf. It may take
Now, let us analyse the grid search's results in order to select the best model.
2. Write a function that takes as input the Grid Search object and that returns the best model **fitted**, the best set of hyperparameter and the associated score:
```python
def select_model_verbose(gs):
return trained_model, best_params, best_score
```
3. Use the trained model to predict on a new point:
```python
new_point = np.array([[3.2031, 52., 5.47761194, 1.07960199, 910., 2.26368159, 37.85, -122.26]])
```
How do we know the best model returned by GridSearchCV is good enough and stable ? That is what we will learn tomorrow !
**WARNING: Some combinations of hyper parameters are not possible. For example using the SVM, the kernel linear has no parameter gamma.**
**Note**:
- GridSearchCV can also take a Pipeline instead of a Machine Learning model. It is useful to combine some Imputers or Dimension reduction techniques with some Machine Learning models in the same Pipeline.
- It may be useful to check on Kaggle if some Kagglers share their Grid Searches.
Ressources:
- https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
- https://stackoverflow.com/questions/38555650/try-multiple-estimator-in-one-grid-search
- https://medium.com/fintechexplained/what-is-grid-search-c01fe886ef0a
- https://elutins.medium.com/grid-searching-in-machine-learning-quick-explanation-and-python-implementation-550552200596
- https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html

249
subjects/ai/train-and-evalute-machine-learning-models/audit/README.md

@ -0,0 +1,249 @@
#### Exercise 0: Environment and libraries
##### The exercise is validated is all questions of the exercise are validated.
##### Activate the virtual environment. If you used `conda` run `conda activate your_env`.
##### Run `python --version`.
###### Does it print `Python 3.x`? x >= 8
##### Does `import jupyter`, `import numpy`, `import pandas`, `import matplotlib` and `import sklearn` run without any error?
---
---
#### Exercise 1: MSE Scikit-learn
The goal of this exercise is to learn to use `sklearn.metrics` to compute the mean squared error (MSE).
1. Compute the MSE using `sklearn.metrics` on `y_true` and `y_pred` below:
```python
y_true = [91, 51, 2.5, 2, -5]
y_pred = [90, 48, 2, 2, -4]
```
---
---
#### Exercise 2: Accuracy Scikit-learn
The goal of this exercise is to learn to use `sklearn.metrics` to compute the accuracy.
1. Compute the accuracy using `sklearn.metrics` on `y_true` and `y_pred` below:
```python
y_pred = [0, 1, 0, 1, 0, 1, 0]
y_true = [0, 0, 1, 1, 1, 1, 0]
```
---
---
#### Exercise 3: Regression
##### The exercise is validated is all questions of the exercise are validated
##### The question 1 is validated if the predictions on the train set and test set are:
```console
#10 first values Train
array([1.54505951, 2.21338527, 2.2636205 , 3.3258957 , 1.51710076,
1.63209319, 2.9265211 , 0.78080924, 1.21968217, 0.72656239])
```
```console
#10 first values Test
array([ 1.82212706, 1.98357668, 0.80547979, -0.19259114, 1.76072418,
3.27855815, 2.12056804, 1.96099917, 2.38239663, 1.21005304])
```
##### The question 2 is validated if the results match this output:
```console
r2 on the train set: 0.3552292936915783
MAE on the train set: 0.5300159371615256
MSE on the train set: 0.5210784446797679
r2 on the test set: 0.30265471284464673
MAE on the test set: 0.5454023699809112
MSE on the test set: 0.5537420654727396
```
This result shows that the model has slightly better results on the train set than the test set. That's frequent since it is easier to get a better grade on an exam we studied than an exam that is different from what was prepared. However, the results are not good: r2 ~ 0.3. Fitting non linear models as the Random Forest on this data may improve the results. That's the goal of the exercise 5.
---
---
#### Exercise 4: Classification
##### The exercise is validated is all questions of the exercise are validated
##### The question 1 is validated if the predictions on the train set and test set are:
```console
# 10 first values Train
array([1, 0, 1, 1, 1, 0, 0, 1, 1, 0])
# 10 first values Test
array([1, 1, 0, 0, 0, 1, 1, 1, 0, 0])
```
##### The question 2 is validated if the results match this output:
```console
F1 on the train set: 0.9911504424778761
Accuracy on the train set: 0.989010989010989
Recall on the train set: 0.9929078014184397
Precision on the train set: 0.9893992932862191
ROC_AUC on the train set: 0.9990161111794368
F1 on the test set: 0.9801324503311258
Accuracy on the test set: 0.9736842105263158
Recall on the test set: 0.9866666666666667
Precision on the test set: 0.9736842105263158
ROC_AUC on the test set: 0.9863247863247864
```
##### The question 2 is validated if the results match the confusion matrix on the test set should be:
```console
array([[37, 2],
[ 1, 74]])
```
##### The question 3 is validated if the ROC AUC plot looks like the plot below:
![alt text][logo_ex4]
[logo_ex4]: ../w2_day4_ex4_q3.png "ROC AUC "
Having a 99% ROC AUC is not usual. The data set we used is easy to classify. On real data sets, always check if there's any leakage while having such a high ROC AUC score.
---
---
#### Exercise 5: Machine Learning models
##### The question is validated if the scores outputted are close to the scores below. Some of the algorithms use random steps (random sampling used by the `RandomForest`). I used `random_state = 43` for the Random Forest, the Decision Tree and the Gradient Boosting.
```console
# Linear regression
TRAIN
r2 on the train set: 0.34823544284172625
MAE on the train set: 0.533092001261455
MSE on the train set: 0.5273648371379568
TEST
r2 on the test set: 0.3551785428138914
MAE on the test set: 0.5196420310323713
MSE on the test set: 0.49761195027083804
# SVM
TRAIN
r2 on the train set: 0.6462366150965996
MAE on the train set: 0.38356451633259875
MSE on the train set: 0.33464478671339165
TEST
r2 on the test set: 0.6162644671183826
MAE on the test set: 0.3897680598426786
MSE on the test set: 0.3477101776543003
# Decision Tree
TRAIN
r2 on the train set: 0.9999999999999488
MAE on the train set: 1.3685733933909677e-08
MSE on the train set: 6.842866883530944e-14
TEST
r2 on the test set: 0.6263651902480918
MAE on the test set: 0.4383758696244002
MSE on the test set: 0.4727017198871596
# Random Forest
TRAIN
r2 on the train set: 0.9705418471542886
MAE on the train set: 0.11983836612191189
MSE on the train set: 0.034538356420577995
TEST
r2 on the test set: 0.7504673649554309
MAE on the test set: 0.31889891600404635
MSE on the test set: 0.24096164834441108
# Gradient Boosting
TRAIN
r2 on the train set: 0.7395782392433273
MAE on the train set: 0.35656543036682264
MSE on the train set: 0.26167490389525294
TEST
r2 on the test set: 0.7157456298013534
MAE on the test set: 0.36455447680396397
MSE on the test set: 0.27058170064218096
```
It is important to notice that the Decision Tree overfits very easily. It learns easily the training data but is not able to extrapolate on the test set. This algorithm is not used a lot because of its overfitting ability.
However, Random Forest and Gradient Boosting propose a solid approach to correct the overfitting (in that case the parameters `max_depth` is set to None that is why the Random Forest overfits the data). These two algorithms are used intensively in Machine Learning Projects.
---
---
#### Exercise 6: Grid Search
##### The exercice is validated is all questions of the exercice are validated
##### The question 1 is validated if the code that runs the `gridsearch` is (the parameters may change):
```python
parameters = {'n_estimators':[10, 50, 75],
'max_depth':[3,5,7],
'min_samples_leaf': [10,20,30]}
rf = RandomForestRegressor()
gridsearch = GridSearchCV(rf,
parameters,
cv = [(np.arange(18576), np.arange(18576,20640))],
n_jobs=-1)
gridsearch.fit(X, y)
```
##### The question 2 is validated if the function is:
```python
def select_model_verbose(gs):
return gs.best_estimator_, gs.best_params_, gs.best_score_
```
In my case, the `gridsearch` parameters are not interesting. Even if I reduced the over-fitting of the Random Forest, the score on the test is lower than the score on the test returned by the Gradient Boosting in the previous exercise without optimal parameters search.
##### The question 3 is validated if the code used is:
```python
model, best_params, best_score = select_model_verbose(gridsearch)
model.predict(new_point)
```

BIN
subjects/ai/train-and-evalute-machine-learning-models/w2_day4_ex4_q3.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 36 KiB

271
subjects/ai/visualizations/README.md

@ -0,0 +1,271 @@
# Visualizations
While working on a dataset it is important to check the distribution of the data. Obviously, for most of humans it is difficult to visualize the data in more than 3 dimensions
"Viz" is important to understand the data and to show results. We'll discover three libraries to visualize data in Python. These are one of the most used visualisation "libraries" in Python:
- Pandas visualization module
- Matplotlib
- Plotly
The goal is to understand the basics of those libraries. You'll have time during the project to master one (or the three) of them.
You may wonder why using one library is not enough. The reason is simple: it depends on the usage.
For example if you want to check the data quickly you may want to use Pandas viz module or Matplotlib.
If you want to plot a custom and more elaborated plot I suggest to use Matplotlib or Plotly.
And, if you want to create a very nice and interactive plot I suggest to use Plotly.
### Exercises of the day
- Exercice 0: Environment and libraries
- Exercise 1: Pandas plot 1
- Exercise 2: Pandas plot 2
- Exercise 3: Matplotlib 1
- Exercise 4: Matplotlib 2
- Exercise 5: Matplotlib subplots
- Exercise 6: Plotly 1
- Exercise 7: Plotly Box plots
### Virtual Environment
- Python 3.x
- NumPy
- Pandas
- Matplotlib
- Plotly
- Jupyter or JupyterLab
I suggest to use the most recent version of the packages.
### Resources
- https://matplotlib.org/3.3.3/tutorials/index.html
- https://towardsdatascience.com/matplotlib-tutorial-learn-basics-of-pythons-powerful-plotting-library-b5d1b8f67596
- https://github.com/rougier/matplotlib-tutorial
- https://jakevdp.github.io/PythonDataScienceHandbook/05.13-kernel-density-estimation.html
---
---
# Exercise 0: Environment and libraries
The goal of this exercise is to set up the Python work environment with the required libraries.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
I recommend to use:
- the **last stable versions** of Python.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy`, `jupyter`, `matplotlib` and `plotly`.
---
---
# Exercise 1: Pandas plot 1
The goal of this exercise is to learn to create plots with use Pandas. Panda's `.plot()` is a wrapper for `matplotlib.pyplot.plot()`.
Here is the data we will be using:
```python
df = pd.DataFrame({
'name':['christopher','marion','maria','mia','clement','randy','remi'],
'age':[70,30,22,19,45,33,20],
'gender':['M','F','F','F','M','M','M'],
'state':['california','dc','california','dc','california','new york','porto'],
'num_children':[2,0,0,3,8,1,4],
'num_pets':[5,1,0,5,2,2,3]
})
```
1. Reproduce this plot. This plot is called a bar plot.
![alt text][logo]
[logo]: ./w1day03_ex1_plot1.png "Bar plot ex1"
The plot has to contain:
- the title
- name on x-axis
- legend
---
---
# Exercise 2: Pandas plot 2
The goal of this exercise is to learn to create plots with use Pandas. Panda's `.plot()` is a wrapper for `matplotlib.pyplot.plot()`.
```python
df = pd.DataFrame({
'name':['christopher','marion','maria','mia','clement','randy','remi'],
'age':[70,30,22,19,45,33,20],
'gender':['M','F','F','F','M','M','M'],
'state':['california','dc','california','dc','california','new york','porto'],
'num_children':[4,2,1,0,3,1,0],
'num_pets':[5,1,0,2,2,2,3]
})
```
1. Reproduce this plot. This plot is called a scatter plot. Do you observe a relationship between the age and the number of children ?
![alt text][logo_ex2]
[logo_ex2]: ./w1day03_ex2_plot1.png "Scatter plot ex2"
The plot has to contain:
- the title
- name on x-axis
- name on y-axis
---
---
# Exercise 3: Matplotlib 1
The goal of this plot is to learn to use Matplotlib to plot data. As you know, Matplotlib is the underlying library used by Pandas. It provides more options to plot custom visualizations. Howerver, most of the plots we will create with Matplotlib can be reproduced with Pandas' `.plot()`.
1. Reproduce this plot. We assume the data points have integers coordinates.
![alt text][logo_ex3]
[logo_ex3]: ./w1day03_ex3_plot1.png "Scatter plot ex3"
The plot has to contain:
- the title
- name on x-axis and y-axis
- x-axis and y-axis are limited to [1,8]
- **style**:
- red dashdot line with a width of 3
- blue circles with a size of 12
---
---
# Exercise 4: Matplotlib 2
The goal of this plot is to learn to use Matplotlib to plot different lines in the same plot on different axis using `twinx`. This very useful to compare variables in different ranges.
Here is the data:
```python
left_data = [5, 7, 11, 13, 17]
right_data = [0.1, 0.2, 0.4, 0.8, -1.6]
x_axis = [0.0, 1.0, 2.0, 3.0, 4.0]
```
1. Reproduce this plot
![alt text][logo_ex4]
[logo_ex4]: ./w1day03_ex4_plot1.png "Twin axis plot ex4"
The plot has to contain:
- the title
- name on left y-axis and right y-axis
- **style**:
- left data in black
- right data in red
---
---
# Exercise 5: Matplotlib subplots
The goal of this exercise is to learn to use Matplotlib to create subplots.
1. Reproduce this plot using a **for loop**:
![alt text][logo_ex5]
[logo_ex5]: ./w1day03_ex5_plot1.png "Subplots ex5"
The plot has to contain:
- 6 subplots: 2 rows, 3 columns
- Keep space between plots: `hspace=0.5` and `wspace=0.5`
- Each plot contains
- Text (2,3,i) centered at 0.5, 0.5. _Hint_: check the parameter `ha` of `text`
- a title: Title i
---
---
# Exercise 6: Plotly 1
Plotly has evolved a lot in the previous years. It is important to **always check the documentation**.
Plotly comes with a high level interface: Plotly Express. It helps building some complex plots easily. The lesson won't detail the complex examples. Plotly express is quite interesting while using Pandas Dataframes because there are some built-in functions that leverage Pandas Dataframes.
The plot outputed by Plotly is interactive and can also be dynamic.
The goal of the exercise is to plot the price of a company. Its price is generated below.
```python
returns = np.random.randn(50)
price = 100 + np.cumsum(returns)
dates = pd.date_range(start='2020-09-01', periods=50, freq='B')
df = pd.DataFrame(zip(dates, price),
columns=['Date','Company_A'])
```
1. Using **Plotly express**, reproduce the plot in the image. As the data is generated randomly I do not expect you to reproduce the same line.
![alt text][logo_ex6]
[logo_ex6]: ./w1day03_ex6_plot1.png "Time series ex6"
The plot has to contain:
- title
- x-axis name
- y-axis name
2. Same question but now using `plotly.graph_objects`. You may need to use `init_notebook_mode` from `plotly.offline`.
https://plotly.com/python/time-series/e
---
---
# Exercise 7: Plotly Box plots
The goal of this exercise is to learn to use Plotly to plot Box Plots. It is t is a method for graphically depicting groups of numerical data through their quartiles and values as min, max. It allows to compare quickly some variables.
Let us generate 3 random arrays from a normal distribution. And for each array add respectively 1, 2 to the normal distribution.
```python
y0 = np.random.randn(50)
y1 = np.random.randn(50) + 1 # shift mean
y2 = np.random.randn(50) + 2
```
1. Plot in the same Figure 2 box plots as shown in the image. In this exercise the style is not important.
![alt text][logo_ex7]
[logo_ex7]: ./w1day03_ex7_plot1.png "Box plot ex7"
The plot has to contain:
- the title
- the legend
https://plotly.com/python/box-plots/

186
subjects/ai/visualizations/audit/README.md

@ -0,0 +1,186 @@
#### Exercise 0: Environment and libraries
##### The exercice is validated is all questions of the exercice are validated
##### Activate the virtual environment. If you used `conda` run `conda activate your_env`
##### Run `python --version`
###### Does it print `Python 3.x`? x >= 8
##### Does `import jupyter`, `import numpy`, `import pandas`, `matplotlib` and `plotly` run without any error ?
---
---
#### Exercise 1: Pandas plot 1
##### The exercice is validated is all questions of the exercice are validated
##### The solution of question 1 is accepted if the plot reproduces the plot in the image and respect those criteria.
###### Does it have a the title ?
###### Does it have a name on x-axis and y-axis ?
![alt text][logo_ex6]
[logo_ex6]: ../w1day03_ex6_plot1.png "Time series ex6"
##### The solution of question 2 is accepted if the plot reproduces the plot in the image by using `plotly.graph_objects` and respect those criteria.
2.This question is validated if the plot is in the image is reproduced using `plotly.graph_objects` given those criteria:
###### Does it have a the title ?
###### Does it have a name on x-axis and y-axis ?
![alt text][logo_ex6]
[logo_ex6]: ../w1day03_ex6_plot1.png "Time series ex6"
---
---
#### Exercise 2: Pandas plot 2
##### The solution of question 1 is accepted if the plot reproduces the plot in the image and respect the following criteria. It is important to observe that the older people are, the the more children they have.
###### Does it have a the title ?
###### Does it have a name on x-axis and y-axis ?
![alt text][logo_ex2]
[logo_ex2]: ../w1day03_ex2_plot1.png "Scatter plot ex2"
---
---
#### Exercise 3: Matplotlib 1
##### The solution of question 1 is accepted if the plot reproduces the plot in the image and respect those criteria.
###### Does it have a the title ?
###### Does it have a name on x-axis and y-axis ?
###### Are the x-axis and y-axis limited to [1,8] ?
###### Is the line a red dashdot line with a width of 3 ?
###### Are the circles blue circles with a size of 12 ?
![alt text][logo_ex3]
[logo_ex3]: ../w1day03_ex3_plot1.png "Scatter plot ex3"
---
---
#### Exercise 4: Matplotlib 2
##### The solution of question 1 is accepted if the plot reproduces the plot in the image and respect those criteria.
###### Does it have a the title ?
###### Does it have a name on x-axis and y-axis ?
###### Is the left data black ?
###### Is the right data red ?
![alt text][logo_ex4]
[logo_ex4]: ../w1day03_ex4_plot1.png "Twin axis ex4"
https://matplotlib.org/gallery/api/two_scales.html
---
---
#### Exercise 5: Matplotlib subplots
##### The solution of question 1 is accepted if the plot reproduces the plot in the image and respect those criteria.
###### Does it contain 6 subplots (2 rows, 3 columns)?
###### Does it have space between plots (`hspace=0.5` and `wspace=0.5`)?
###### Do all subplots contain a title: `Title i` ?
###### Do all subplots contain a text `(2,3,i)` centered at `(0.5, 0.5)`? _Hint_: check the parameter `ha` of `text`
###### Have all subplots been created in a for loop ?
![alt text][logo_ex5]
[logo_ex5]: ../w1day03_ex5_plot1.png "Subplots ex5"
---
---
#### Exercise 6: Plotly 1
##### The exercice is validated is all questions of the exercice are validated
##### The solution of question 1 is accepted if the plot reproduces the plot in the image and respect those criteria.
###### Does it have a the title ?
###### Does it have a name on x-axis and y-axis ?
![alt text][logo_ex6]
[logo_ex6]: ../w1day03_ex6_plot1.png "Time series ex6"
##### The solution of question 2 is accepted if the plot reproduces the plot in the image by using `plotly.graph_objects` and respect those criteria.
2.This question is validated if the plot is in the image is reproduced using `plotly.graph_objects` given those criteria:
###### Does it have a the title ?
###### Does it have a name on x-axis and y-axis ?
![alt text][logo_ex6]
[logo_ex6]: ../w1day03_ex6_plot1.png "Time series ex6"
---
---
#### Exercise 7: Plotly Box plots
##### The solution of question 1 is accepted if the plot reproduces the plot in the image and respect those criteria. The code below shows a solution.
###### Does it have a the title ?
###### Does it have a legend ?
![alt text][logo_ex7]
[logo_ex7]: ../w1day03_ex7_plot1.png "Box plot ex7"
```python
import plotly.graph_objects as go
import numpy as np
y0 = np.random.randn(50)
y1 = np.random.randn(50) + 1 # shift mean
y2 = np.random.randn(50) + 2
fig = go.Figure()
fig.add_trace(go.Box(y=y0, name='Sample A',
marker_color = 'indianred'))
fig.add_trace(go.Box(y=y1, name = 'Sample B',
marker_color = 'lightseagreen'))
fig.show()
```

BIN
subjects/ai/visualizations/w1day03_ex1_plot1.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 9.5 KiB

BIN
subjects/ai/visualizations/w1day03_ex2_plot1.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 11 KiB

BIN
subjects/ai/visualizations/w1day03_ex3_plot1.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 27 KiB

BIN
subjects/ai/visualizations/w1day03_ex4_plot1.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 18 KiB

BIN
subjects/ai/visualizations/w1day03_ex5_plot1.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 13 KiB

BIN
subjects/ai/visualizations/w1day03_ex6_plot1.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 43 KiB

BIN
subjects/ai/visualizations/w1day03_ex7_plot1.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 13 KiB

Loading…
Cancel
Save