Browse Source

fix: exercice to exercise

pull/1/head
b.ghazlane 3 years ago
parent
commit
e6827de0d7
  1. 18
      one_md_per_day_format/piscine/Week1/day1.md
  2. 24
      one_md_per_day_format/piscine/Week1/day2.md
  3. 24
      one_md_per_day_format/piscine/Week1/day3.md
  4. 26
      one_md_per_day_format/piscine/Week1/day4.md
  5. 20
      one_md_per_day_format/piscine/Week1/day5.md
  6. 32
      one_md_per_day_format/piscine/Week2/day03.md
  7. 22
      one_md_per_day_format/piscine/Week2/day05.md
  8. 22
      one_md_per_day_format/piscine/Week2/day1.md
  9. 28
      one_md_per_day_format/piscine/Week2/day2.md
  10. 28
      one_md_per_day_format/piscine/Week2/day4.md
  11. 10
      one_md_per_day_format/piscine/Week2/template.md
  12. 10
      one_md_per_day_format/piscine/Week3/template.md
  13. 18
      one_md_per_day_format/piscine/Week3/w3day02.md
  14. 22
      one_md_per_day_format/piscine/Week3/w3day03.md
  15. 40
      one_md_per_day_format/piscine/Week3/w3day04.md
  16. 24
      one_md_per_day_format/piscine/Week3/w3day05.md
  17. 20
      one_md_per_day_format/piscine/Week3/w3day1.md

18
one_md_per_day_format/piscine/Week1/day1.md

@ -29,7 +29,7 @@ Save one notebook per day or one per exercise. Use markdown to divide your noteb
- https://numpy.org/doc/
- https://jakevdp.github.io/PythonDataScienceHandbook/
# Exercice 1 Your first NumPy array
# Exercise 1 Your first NumPy array
The goal of this exercise is to use many Python data types in **NumPy** arrays. **NumPy** arrays are intensively used in **NumPy** and **Pandas**. They are flexible and allow to use optimized **NumPy** underlying functions.
@ -71,7 +71,7 @@ for i in your_np_array:
---
# Exercice 2 Zeros
# Exercise 2 Zeros
The goal of this exercise is to learn to create a NumPy array with 0s.
@ -86,7 +86,7 @@ The goal of this exercise is to learn to create a NumPy array with 0s.
---
# Exercice 3 Slicing
# Exercise 3 Slicing
The goal of this exercise is to learn NumPy indexing/slicing. It allows to access values of the NumPy array efficiently and without a for loop.
@ -113,7 +113,7 @@ integers[mask] = 0
---
# Exercice 4 Random
# Exercise 4 Random
The goal of this exercise is to learn to generate random data.
In Data Science it is extremely useful to generate random data for many reasons:
@ -174,7 +174,7 @@ For this exercise, as the results may change depending on the version of the pac
---
# Exercice 5: Split, concatenate, reshape arrays
# Exercise 5: Split, concatenate, reshape arrays
The goal of this exercise is to learn to concatenate and reshape arrays.
@ -214,7 +214,7 @@ https://jakevdp.github.io/PythonDataScienceHandbook/ (section: The Basics of Num
---
# Exercice 6: Broadcasting and Slicing
# Exercise 6: Broadcasting and Slicing
The goal of this exercise is to learn to access values of n-dimensional arrays efficiently.
@ -266,7 +266,7 @@ https://jakevdp.github.io/PythonDataScienceHandbook/ (section: Computation on Ar
---
# Exercice 7: NaN
# Exercise 7: NaN
The goal of this exercise is to learn to deal with missing data in NumPy and to manipulate NumPy arrays.
@ -323,7 +323,7 @@ This question is validated if, without having used a for loop or having filled t
---
# Exercice 8: Wine
# Exercise 8: Wine
The goal of this exercise is to learn to perform a basic data analysis on real data using NumPy.
@ -404,7 +404,7 @@ This can be done in three steps: Get the max, create a boolean mask that indicat
---
## Exercice 9 Football tournament
## Exercise 9 Football tournament
The goal of this exercise is to learn to use permutations, complex

24
one_md_per_day_format/piscine/Week1/day2.md

@ -17,7 +17,7 @@ Not only is the Pandas library a central component of the data science toolkit b
Pandas is built on top of the NumPy package, meaning a lot of the structure of NumPy is used or replicated in Pandas. Data in pandas is often used to feed statistical analysis in SciPy, plotting functions from Matplotlib, and machine learning algorithms in Scikit-learn.
Most of the topics we will cover today are explained and describes with examples in the first ressource. The number of exercices is low on purpose: Take the time to understand the chapter 5 of the ressource, even if there are 40 pages.
Most of the topics we will cover today are explained and describes with examples in the first ressource. The number of exercises is low on purpose: Take the time to understand the chapter 5 of the ressource, even if there are 40 pages.
The version of Pandas I used is '1.0.1'.
@ -41,9 +41,9 @@ https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/
https://jakevdp.github.io/PythonDataScienceHandbook/03.04-missing-values.html
# Exercice 1
# Exercise 1
The goal of this exercice is to learn to create basic Pandas objects.
The goal of this exercise is to learn to create basic Pandas objects.
1. Create a DataFrame as below this using two ways:
- From a NumPy array
@ -82,9 +82,9 @@ and if the types of the first value of the columns are
```
# Exercice 2 **Electric power consumption**
# Exercise 2 **Electric power consumption**
The goal of this exercice is to learn to manipulate real data with Pandas.
The goal of this exercise is to learn to manipulate real data with Pandas.
The data set used is **Individual household electric power consumption**
1. Delete the columns `Time`, `Sub_metering_2` and `Sub_metering_3`
@ -118,7 +118,7 @@ The data set used is **Individual household electric power consumption**
## Correction:
1. `del` works but it is not a solution I recommand. For this exercice it is accepted. It is expected to use `drop` with `axis=1`. `inplace=True` may be useful to avoid to affect the result to a variable.
1. `del` works but it is not a solution I recommand. For this exercise it is accepted. It is expected to use `drop` with `axis=1`. `inplace=True` may be useful to avoid to affect the result to a variable.
2. The prefered solution is `set_index` with `inplace=True`. As long as the DataFrame returns the output below, the solution is accepted. If the type of the index is not `dtype='datetime64[ns]'` the solution is not accepted.
@ -219,9 +219,9 @@ The data set used is **Individual household electric power consumption**
# Exercice 3: E-commerce purchases
# Exercise 3: E-commerce purchases
The goal of this exercice is to learn to manipulate real data with Pandas. This exercice is less guided since the exercice 2 should have given you a nice introduction.
The goal of this exercise is to learn to manipulate real data with Pandas. This exercise is less guided since the exercise 2 should have given you a nice introduction.
The data set used is **E-commerce purchases**.
@ -240,7 +240,7 @@ Questions:
12. What are the top 5 most popular email providers/hosts (e.g. gmail.com, yahoo.com, etc...)
## Correction
The validate this exercice all answers should return the expected numerical value given in the correction AND uses Pandas. For example using NumPy to compute the mean doesn't respect the philosophy of the exercice which is to use Pandas.
The validate this exercise all answers should return the expected numerical value given in the correction AND uses Pandas. For example using NumPy to compute the mean doesn't respect the philosophy of the exercise which is to use Pandas.
1. How many rows and columns are there?**10000 entries**
@ -303,9 +303,9 @@ The validate this exercice all answers should return the expected numerical valu
The prefered solution is based on the usage of `apply` on a `lambda` function that slices the string that contains the email. The `lambda` function uses `split` to split the string on `@`. Finally, `value_counts` is used to count the occurences.
# Exercice 3 Handling missing values
# Exercise 3 Handling missing values
The goal of this exercice is to learn to handle missing values. In the previsous exercice we used the first techniques: filter out the missing values. We were lucky because the proportion of missing values was low. But in some cases, dropping the missing values is not possible because the filtered data set would be too small.
The goal of this exercise is to learn to handle missing values. In the previsous exercise we used the first techniques: filter out the missing values. We were lucky because the proportion of missing values was low. But in some cases, dropping the missing values is not possible because the filtered data set would be too small.
This article explains the different types of missing data and how they should be handled. https://towardsdatascience.com/data-cleaning-with-python-and-pandas-detecting-missing-values-3e9c6ebcf78b
"
@ -327,7 +327,7 @@ This article explains the different types of missing data and how they should be
## Correction
To validate the exercice, you should have done these two steps in that order:
To validate the exercise, you should have done these two steps in that order:
- Convert the numerical columns to `float`
```

24
one_md_per_day_format/piscine/Week1/day3.md

@ -33,9 +33,9 @@ https://jakevdp.github.io/PythonDataScienceHandbook/05.13-kernel-density-estimat
# Exercice 1 Pandas plot 1
# Exercise 1 Pandas plot 1
The goal of this exercice is to learn to create plots with use Pandas. Panda's `.plot()` is a wrapper for `matplotlib.pyplot.plot()`.
The goal of this exercise is to learn to create plots with use Pandas. Panda's `.plot()` is a wrapper for `matplotlib.pyplot.plot()`.
Here is the data we will be using:
@ -69,9 +69,9 @@ The plot has to contain:
[logo]: images/day03/w1day03_ex1_plot1.png "Bar plot ex1"
## Exercice 2: Pandas plot 2
## Exercise 2: Pandas plot 2
The goal of this exercice is to learn to create plots with use Pandas. Panda's `.plot()` is a wrapper for `matplotlib.pyplot.plot()`.
The goal of this exercise is to learn to create plots with use Pandas. Panda's `.plot()` is a wrapper for `matplotlib.pyplot.plot()`.
```
@ -108,7 +108,7 @@ You should also observe that the older people are the bigger the number of chil
## Exercice 3 Matplotlib 1
## Exercise 3 Matplotlib 1
The goal of this plot is to learn to use Matplotlib to plot data. As you know, Matplotlib is the underlying library used by Pandas. It provides more options to plot custom visualizations. Howerver, most of the plots we will create with Matplotlib can be reproduced with Pandas' `.plot()`.
@ -145,7 +145,7 @@ The plot has to contain:
[logo_ex3]: images/day03/w1day03_ex3_plot1.png "Scatter plot ex3"
# Exercice 4 Matplotlib 2
# Exercise 4 Matplotlib 2
The goal of this plot is to learn to use Matplotlib to plot different lines in the same plot on different axis using `twinx`. This very useful to compare variables in different ranges.
Here is the data:
@ -187,7 +187,7 @@ The plot has to contain:
https://matplotlib.org/gallery/api/two_scales.html
# Exercice 5 Matplotlib subplots
# Exercise 5 Matplotlib subplots
The goal of this exerice is to learn to use Matplotlib to create subplots.
1. Reproduce this plot using a **for loop**:
@ -224,14 +224,14 @@ The plot has to contain:
Check that the plot has been created with a for loop.
# Exercice 6 Plotly 1
# Exercise 6 Plotly 1
Plotly has evolved a lot in the previous years. It is important to **always check the documentation**.
Plotly comes with a high level interface: Plotly Express. It helps building some complex plots easily. The lesson won't detail the complex examples. Plotly express is quite interesting while using Pandas Dataframes because there are some built-in functions that leverage Pandas Dataframes.
The plot outputed by Plotly is interactive and can also be dynamic.
The goal of the exercice is to plot the price of a company. Its price is generated below.
The goal of the exercise is to plot the price of a company. Its price is generated below.
```
returns = np.random.randn(50)
@ -284,9 +284,9 @@ The plot has to contain:
[logo_ex6]: images/day03/w1day03_ex6_plot1.png "Time series ex6"
# Exercice 7 Plotly Box plots
# Exercise 7 Plotly Box plots
The goal of this exercice is to learn to use Plotly to plot Box Plots. It is t is a method for graphically depicting groups of numerical data through their quartiles and values as min, max. It allows to compare quickly some variables.
The goal of this exercise is to learn to use Plotly to plot Box Plots. It is t is a method for graphically depicting groups of numerical data through their quartiles and values as min, max. It allows to compare quickly some variables.
Let us generate 3 random arrays from a normal distribution. And for each array add respectively 1, 2 to the normal distribution.
@ -295,7 +295,7 @@ y0 = np.random.randn(50)
y1 = np.random.randn(50) + 1 # shift mean
y2 = np.random.randn(50) + 2
```
1. Plot in the same Figure 2 box plots as shown in the image. In this exercice the style is not important.
1. Plot in the same Figure 2 box plots as shown in the image. In this exercise the style is not important.
![alt text][logo_ex7]

26
one_md_per_day_format/piscine/Week1/day4.md

@ -25,9 +25,9 @@ https://towardsdatascience.com/different-ways-to-iterate-over-rows-in-a-pandas-d
# Exercice 1 Concatenate
# Exercise 1 Concatenate
The goal of this exercice is to learn to concatenate DataFrames. The logic is the same for the Series.
The goal of this exercise is to learn to concatenate DataFrames. The logic is the same for the Series.
Here are the two DataFrames to concatenate:
@ -55,9 +55,9 @@ df2 = pd.DataFrame([['c', 1], ['d', 2]],
| 3 | d | 2 |
# Exercice 2 Merge
# Exercise 2 Merge
The goal of this exercice is to learn to merge DataFrames
The goal of this exercise is to learn to merge DataFrames
The logic of merging DataFrames in Pandas is quite similar as the one used in SQL.
Here are the two DataFrames to merge:
@ -125,9 +125,9 @@ df2 = pd.DataFrame(df2_dict, columns = ['id', 'Feature1', 'Feature2'])
Note: Check that the suffixes are set using the suffix parameters rather than manually changing the columns' name.
## Exercice 3 Merge MultiIndex
## Exercise 3 Merge MultiIndex
The goal of this exercice is to learn to merge DataFrames with MultiIndex.
The goal of this exercise is to learn to merge DataFrames with MultiIndex.
Use the code below to generate the DataFrames. `market_data` contains fake market data. In finance, the market is available during the trading days (business days). `alternative_data` contains fake alternative data from social media. This data is available every day. But, for some reasons the Data Engineer lost the last 15 days of alternative data.
1. Using `market_data` as the reference, merge `alternative_data` on `market_data`
@ -182,9 +182,9 @@ One of the answers that returns the correct DataFrame is:
2. This question is validated if the number of missing in the DataFrame is equal to 0 and if `filled_df.sum().sum() == merged_df.sum().sum()` gives: `True`
# Exercice 4 Groupby Apply
# Exercise 4 Groupby Apply
The goal of this exercice is to learn to group the data and apply a function on the groups.
The goal of this exercise is to learn to group the data and apply a function on the groups.
The use case we will work on is computing
1. Create a function that uses `pandas.DataFrame.clip` and that replace extreme values by a given percentile. The values that are greater than the upper percentile 80% are replaced by the percentile 80%. The values that are smaller than the lower percentile 20% are replaced by the percentile 20%. This process that correct outliers is called **winsorizing**.
@ -251,7 +251,7 @@ I recommend to use NumPy to compute the percentiles to make sure we used the sam
## Correction
The for loop is forbidden in this exercice. The goal is to use `groupby` and `apply`.
The for loop is forbidden in this exercise. The goal is to use `groupby` and `apply`.
1. This question is validated if the output is:
@ -315,9 +315,9 @@ https://towardsdatascience.com/how-to-use-the-split-apply-combine-strategy-in-pa
# Exercice 5 Groupby Agg
# Exercise 5 Groupby Agg
The goal of this exercice is to learn to compute different type of agregations on the groups. This small DataFrame contains products and prices.
The goal of this exercise is to learn to compute different type of agregations on the groups. This small DataFrame contains products and prices.
| | value | product |
|---:|--------:|:-------------|
@ -353,9 +353,9 @@ Note: The columns don't have to be MultiIndex
My answer is: `df.groupby('product').agg({'value':['min','max','mean']})`
# Exercice 6 Unstack
# Exercise 6 Unstack
The goal of this exercice is to learn to unstack a MultiIndex.
The goal of this exercise is to learn to unstack a MultiIndex.
Let's assume we trained a machine learning model that predicts a daily score on the companies (tickers) below. It may be very useful to unstack the MultiIndex: plot the time series, vectorize the backtest etc ...
```

20
one_md_per_day_format/piscine/Week1/day5.md

@ -31,9 +31,9 @@ https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/
# Exercice 1
# Exercise 1
The goal of this exercice is to learn to manipulate time series in Pandas.
The goal of this exercise is to learn to manipulate time series in Pandas.
1. Create a `Series` named `integer_series`from 1st January 2010 to 31 December 2020. At each date is associated the number of days since 1st January 2010. It starts with 0.
@ -79,9 +79,9 @@ The goal of this exercice is to learn to manipulate time series in Pandas.
```
If the `NaN` values have been dropped the solution is also accepted. The solution uses `rolling().mean()`.
# Exercice 2
# Exercise 2
The goal of this exercice is to learn to use Pandas on Time Series an on Financial data.
The goal of this exercise is to learn to use Pandas on Time Series an on Financial data.
The data we will use is Apple stock.
@ -144,11 +144,11 @@ To get this result there are two ways: `resample` and `groupby`. There are two k
Name: Open, Length: 10118, dtype: float64
```
- The first way is to compute the return without for loop is to use `pct_change`
- The second way to compute the return without for loop is to implement the formula given in the exercice in a vectorized way. To get the value at `t-1` you can use `shift`
- The second way to compute the return without for loop is to implement the formula given in the exercise in a vectorized way. To get the value at `t-1` you can use `shift`
# Exercice 3 Multi asset returns
# Exercise 3 Multi asset returns
The goal of this exercice is to learn to compute daily returns on a DataFrame that contains many assets (multi-assets).
The goal of this exercise is to learn to compute daily returns on a DataFrame that contains many assets (multi-assets).
```
business_dates = pd.bdate_range('2021-01-01', '2021-12-31')
@ -187,9 +187,9 @@ Note: The data is generated randomly, the values you may have a different result
The DataFrame contains random data. Make sure your output and the one returned by this code is based on the same DataFrame.
# Exercice 4 Backtest
# Exercise 4 Backtest
The goal of this exercice is to learn to perform a backtest in Pandas. A backtest is a tool that allows you to know how a strategy would have performed retrospectively using historical data. In this exercice we will focus on the backtesting tool and not on how to build the best strategy.
The goal of this exercise is to learn to perform a backtest in Pandas. A backtest is a tool that allows you to know how a strategy would have performed retrospectively using historical data. In this exercise we will focus on the backtesting tool and not on how to build the best strategy.
We will backtest a **long only** strategy on Apple Inc. Long only means that we only consider buying the stock. The input signal at date d says if the close price will increase at d+1. We assume that the input signal is available before the market closes.
@ -266,7 +266,7 @@ My results can be reproduced using: `np.random.seed = 2712`. Given the versions
Name: Daily_futur_returns, Length: 10118, dtype: float64
```
The answer is also accepted if the returns is computed as in the exercice 2 and then shifted in the futur using `shift`, but I do not recommend this implementation as it adds missing values !
The answer is also accepted if the returns is computed as in the exercise 2 and then shifted in the futur using `shift`, but I do not recommend this implementation as it adds missing values !
An example of solution is:

32
one_md_per_day_format/piscine/Week2/day03.md

@ -35,9 +35,9 @@ This object takes as input the preprocessing transforms and a Machine Learning m
## Ressources
TODO
# Exercice 1 Imputer 1
# Exercise 1 Imputer 1
The goal of this exercice is to learn how to use an Imputer to fill missing values on basic example.
The goal of this exercise is to learn how to use an Imputer to fill missing values on basic example.
```
train_data = [[7, 6, 5],
@ -84,11 +84,11 @@ test_data = [[np.nan, 1, 2],
[ 4., 2., 4.]])
```
# Exercice 2 Scaler
# Exercise 2 Scaler
The goal of this exercice is to learn to scale a data set. There are various scaling techniques, we will focus on `StandardScaler` from scikit learn.
The goal of this exercise is to learn to scale a data set. There are various scaling techniques, we will focus on `StandardScaler` from scikit learn.
We will use a tiny data set for this exercice that we will generate by ourselves:
We will use a tiny data set for this exercise that we will generate by ourselves:
```
X_train = np.array([[ 1., -1., 2.],
@ -140,8 +140,8 @@ array([[ 1.22474487, -1.22474487, 0.53452248],
[ 0. , 1.22474487, 0.53452248]])
```
# Exercice 3 One hot Encoder
The goal of this exercice is to learn how to deal with Categorical variables using the OneHot Encoder.
# Exercise 3 One hot Encoder
The goal of this exercise is to learn how to deal with Categorical variables using the OneHot Encoder.
```
X_train = [['Python'], ['Java'], ['Java'], ['C++']]
@ -199,8 +199,8 @@ https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEn
# Exercice 4 Ordinal Encoder
The goal of this exercice is to learn how to deal with Categorical variables using the Ordinal Encoder.
# Exercise 4 Ordinal Encoder
The goal of this exercise is to learn how to deal with Categorical variables using the Ordinal Encoder.
In that case, we want the model to consider that: **good > neutral > bad**
@ -242,9 +242,9 @@ array([[2.],
# Exercice 5 Categorical variables
# Exercise 5 Categorical variables
The goal of this exercice is to learn how to deal with Categorical variables with Ordinal Encoder, Label Encoder and OneHot Encoder.
The goal of this exercise is to learn how to deal with Categorical variables with Ordinal Encoder, Label Encoder and OneHot Encoder.
Preliminary:
- Load the breast-cancer.csv file
@ -359,7 +359,7 @@ AttributeError: Transformer ordinalencoder (type OrdinalEncoder) does not provid
```
**It means that if you want to use the Ordinal Encoder, you will have to create a variable that contains the columns name in the right order. This step is not required in that exercice**
**It means that if you want to use the Ordinal Encoder, you will have to create a variable that contains the columns name in the right order. This step is not required in that exercise**
@ -438,9 +438,9 @@ array([[1., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0., 1., 0., 2., 2., 0.,
```
# Exercice 6 Pipeline
# Exercise 6 Pipeline
The goal of this exercice is to learn to use the Scikit-learn object: Pipeline. The data set: used for this exercice is the `iris` data set.
The goal of this exercise is to learn to use the Scikit-learn object: Pipeline. The data set: used for this exercise is the `iris` data set.
Preliminary:
- Run the code below.
@ -513,9 +513,9 @@ On financial data set, the ratio signal to noise is low. Trying to forecast stoc
# Exercice 1 Imputer 2
# Exercise 1 Imputer 2
The goal of this exercice is to learn how to use an Imputer to fill missing values in the data set.
The goal of this exercise is to learn how to use an Imputer to fill missing values in the data set.
**Reminder**: The data exploration should be done first. It tells which rows/variables should be removed because there are too many missing values. Then the remaining data points can be treated using an Imputer.

22
one_md_per_day_format/piscine/Week2/day05.md

@ -6,12 +6,12 @@
# Introduction
If you finished yesterday's exercices you should be able to train several Machine Learning algorithms and to choose one returned by GridSearchCV.
If you finished yesterday's exercises you should be able to train several Machine Learning algorithms and to choose one returned by GridSearchCV.
GridSearchCV returns the model that gives the best score on the test set. Yesterday, as I told you, I changed the **cv** parameter to compute the GridSearch with a train set and a test set.
It means that the selected model is based on one single measure. What if, by luck, we predict correctly on that section ? What if the best model is bad ? What if I could have selected a better model ?
We will answer these questions today ! The topics we will cover are the one of the most important in Machine Learning.
Must read before to start the exercices:
Must read before to start the exercises:
- Biais-Variance trade off; aka Underfitting/Overfitting.
- https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine-learning/
@ -28,9 +28,9 @@ Must read before to start the exercices:
## Ressources
# Exercice 1: K-Fold
# Exercise 1: K-Fold
The goal of this exercice is to learn to use `KFold` to split the data set in a k-fold cross validation. Most of the time you won't use this function to split your data because this function is used by others as `cross_val_score` or `cross_validate` or `GridSearchCV` ... . But, this allows to understand the splitting and to create a custom one if needed.
The goal of this exercise is to learn to use `KFold` to split the data set in a k-fold cross validation. Most of the time you won't use this function to split your data because this function is used by others as `cross_val_score` or `cross_validate` or `GridSearchCV` ... . But, this allows to understand the splitting and to create a custom one if needed.
```
X = np.array(np.arange(1,21).reshape(10,-1))
@ -81,9 +81,9 @@ y = np.array(np.arange(1,11))
# Exercice 2: Cross validation (k-fold)
# Exercise 2: Cross validation (k-fold)
The goal of this exercice is to learn how to use cross validation. After reading the articles you should be able to explain why we need to cross-validate the models. We will firstly focus on Linear Regression to reduce the computation time. We will be using `cross_validate` to run the cross validation. Note that `cross_val_score` is similar bu the `cross_validate` calculates one or more scores and timings for each CV split.
The goal of this exercise is to learn how to use cross validation. After reading the articles you should be able to explain why we need to cross-validate the models. We will firstly focus on Linear Regression to reduce the computation time. We will be using `cross_validate` to run the cross validation. Note that `cross_val_score` is similar bu the `cross_validate` calculates one or more scores and timings for each CV split.
Preliminary:
@ -159,9 +159,9 @@ The model is consistent across folds: it is stable. That's a first sign that the
# Exercice 3 GridsearchCV
# Exercise 3 GridsearchCV
The goal of this exercice is to learn to use GridSearchCV to run a grid search, predict on the test set and score on the test set.
The goal of this exercise is to learn to use GridSearchCV to run a grid search, predict on the test set and score on the test set.
Preliminary:
@ -250,13 +250,13 @@ WARNING: If the score used in classification is the AUC, there is one rare case
# Exercice 5 Validation curve and Learning curve
# Exercise 5 Validation curve and Learning curve
The goal of this exercice is to learn to analyse the models' performance with two tools:
The goal of this exercise is to learn to analyse the models' performance with two tools:
- Validation curve
- Learning curve
For this exercice we will use a dataset of 100k data points to give you an idea of the computation time you can expect during projects.
For this exercise we will use a dataset of 100k data points to give you an idea of the computation time you can expect during projects.
Preliminary:

22
one_md_per_day_format/piscine/Week2/day1.md

@ -51,9 +51,9 @@ https://scikit-learn.org/stable/tutorial/index.html
- https://developers.google.com/machine-learning/crash-course/training-and-test-sets/video-lecture?hl=en
# Exercice 1 Scikit-learn estimator
# Exercise 1 Scikit-learn estimator
The goal of this exercice is to learn to fit a Scikit-learn estimator and use it to predict.
The goal of this exercise is to learn to fit a Scikit-learn estimator and use it to predict.
```
@ -92,9 +92,9 @@ X, y = [[1],[2.1],[3]], [[1],[2],[3]]
```
# Exercice 2 Linear regression in 1D
# Exercise 2 Linear regression in 1D
The goal of this exercice is to understand how the linear regression works in one dimension. To do so, we will generate a data in one dimension. Using `make regression` from Scikit-learn, generate a data set with 100 observations:
The goal of this exercise is to understand how the linear regression works in one dimension. To do so, we will generate a data in one dimension. Using `make regression` from Scikit-learn, generate a data set with 100 observations:
```
X, y, coef = make_regression(n_samples=100,
@ -162,9 +162,9 @@ array([ 83.86186727, 140.80961751, 116.3333897 , 64.52998689,
6. This question is validated if the MSE returned is `2854.2871542048706`
# Exercice 3: Train test split
# Exercise 3: Train test split
The goal of this exercice is to learn to split a data set. It is important to understand why we split the data in two sets. To put it in a nutshell: the Machine Learning algorithms learns on the training data and is evaluated on the that it hasn't seen before: the testing data.
The goal of this exercise is to learn to split a data set. It is important to understand why we split the data in two sets. To put it in a nutshell: the Machine Learning algorithms learns on the training data and is evaluated on the that it hasn't seen before: the testing data.
This video gives a basic and nice explanation: https://www.youtube.com/watch?v=_vdMKioCXqQ
@ -208,10 +208,10 @@ y_test:
[ 9 10]
```
# Exercice 4 Forecast diabetes progression
# Exercise 4 Forecast diabetes progression
The goal of this exercice is to use Linear Regression to forecast the progression of diabetes. It will not always be precised, you should **ALWAYS** start doing an exploratory data analysis in order to have a good understanding of the data you model. As a reminder here an introduction to EDA:
The goal of this exercise is to use Linear Regression to forecast the progression of diabetes. It will not always be precised, you should **ALWAYS** start doing an exploratory data analysis in order to have a good understanding of the data you model. As a reminder here an introduction to EDA:
https://towardsdatascience.com/exploratory-data-analysis-eda-a-practical-guide-and-template-for-structured-data-abfbf3ee3bd9
The data set used is described in https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.
@ -300,11 +300,11 @@ https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset
4. This question is validated if the mse on the **train set** is `2888.326888` and the mse on the **test set** is `2858.255153`.
## Exercice 5 Gradient Descent
## Exercise 5 Gradient Descent
The goal of this exercice is to understand how the Linear Regression algorithm finds the optimal coefficients.
The goal of this exercise is to understand how the Linear Regression algorithm finds the optimal coefficients.
The goal is to fit a Linear Regression on a one dimensional features data **without using Scikit-learn**. Let's use the data set we generated for the exercice 1:
The goal is to fit a Linear Regression on a one dimensional features data **without using Scikit-learn**. Let's use the data set we generated for the exercise 1:
```

28
one_md_per_day_format/piscine/Week2/day2.md

@ -31,8 +31,8 @@ More details:
https://towardsdatascience.com/understanding-logistic-regression-9b02c2aec102
For the linear regression exercices, the loss (Mean Square Error - MSE) is minimized with an algorithm called **gradient descent**. In the classification, the loss MSE can't be used because the output of the model is 0 or 1 (for binary classfication).
The **logloss** or **cross entropy** is the loss used for classification. Similarly, it has some nice mathematical properties. The minimization of the **logloss** is not covered in the exercices. However, since it is used in most machine learning models for classification, I recommand to spend some time reading the related article. This article gives a nice example of how it works:
For the linear regression exercises, the loss (Mean Square Error - MSE) is minimized with an algorithm called **gradient descent**. In the classification, the loss MSE can't be used because the output of the model is 0 or 1 (for binary classfication).
The **logloss** or **cross entropy** is the loss used for classification. Similarly, it has some nice mathematical properties. The minimization of the **logloss** is not covered in the exercises. However, since it is used in most machine learning models for classification, I recommand to spend some time reading the related article. This article gives a nice example of how it works:
https://towardsdatascience.com/cross-entropy-for-classification-d98e7f974451
@ -48,9 +48,9 @@ https://medium.com/swlh/what-is-logistic-regression-62807de62efa
# Exercice 1 Logistic regression in Scikit-learn
# Exercise 1 Logistic regression in Scikit-learn
The goal of this exercice is to learn to use Scikit-learn to classify data.
The goal of this exercise is to learn to use Scikit-learn to classify data.
```
X = [[0],[0.1],[0.2], [1],[1.1],[1.2], [1.3]]
y = [0,0,0,1,1,1,0]
@ -93,9 +93,9 @@ Score:
```
# Exercice 2 Sigmoid
# Exercise 2 Sigmoid
The goal of this exercice is to learn to compute and plot the sigmoid function.
The goal of this exercise is to learn to compute and plot the sigmoid function.
1. On the same plot, plot the sigmoid function and the custom sigmoids defined as:
```
@ -121,9 +121,9 @@ The plot should look like this:
# Exercice 3 Decision boundary
# Exercise 3 Decision boundary
The goal of this exercice is to learn to fit a logistic regression on simple examples and to understand how the algorithm separated the data from the different classes.
The goal of this exercise is to learn to fit a logistic regression on simple examples and to understand how the algorithm separated the data from the different classes.
## 1 dimension
@ -304,9 +304,9 @@ As mentioned, it is not required to shift the class prediction to make the plot
# Exercice 4: Train test split
# Exercise 4: Train test split
The goal of this exercice is to learn to split a classification data set. The idea is the same as splitting a regression data set but there's one important detail specific to the classification: the proportion of each class in the train set and test set.
The goal of this exercise is to learn to split a classification data set. The idea is the same as splitting a regression data set but there's one important detail specific to the classification: the proportion of each class in the train set and test set.
@ -358,9 +358,9 @@ The proportion of class `1` is **0.125** in the train set and **1.** in the test
2. This question is validated if the proportion of class `1` is **0.3** for both sets.
# Exercice 5 Breast Cancer prediction
# Exercise 5 Breast Cancer prediction
The goal of this exercice is to use Logistic Regression
The goal of this exercise is to use Logistic Regression
to predict breast cancer. It is always important to understand the data before training any Machine Learning algorithm. The data is described in **breast-cancer-wisconsin.names**. I suggest to add manually the column names in the DataFrame.
Preliminary:
@ -439,9 +439,9 @@ array([[90, 2],
As said, for some reasons, you may have slighty different results because of the data splitting. However, the values you have in the confusion matrix should be close to these results.
# Exercice 6 Multi-class (Optional)
# Exercise 6 Multi-class (Optional)
The goal of this exercice is to learn to train a classfication algorithm on a multi-class labelled data.
The goal of this exercise is to learn to train a classfication algorithm on a multi-class labelled data.
Some algorithms as SVM or Logistic Regression do not natively support multi-class (more than 2 classes). There are some approaches that allow to use these algorithms on multi-class data.
Let's assume we work with 3 classes: A, B and C.

28
one_md_per_day_format/piscine/Week2/day4.md

@ -36,9 +36,9 @@ https://www.kdnuggets.com/2018/06/right-metric-evaluating-machine-learning-model
https://scikit-learn.org/stable/modules/model_evaluation.html
# Exercice 1 MSE Scikit-learn
# Exercise 1 MSE Scikit-learn
The goal of this exercice is to learn to use `sklearn.metrics` to compute the mean squared error (MSE).
The goal of this exercise is to learn to use `sklearn.metrics` to compute the mean squared error (MSE).
1. Compute the MSE using `sklearn.metrics` on `y_true` and `y_pred` below:
@ -51,10 +51,10 @@ y_pred = [90, 48, 2, 2, -4]
1. This question is validated if the MSE outputted is **2.25**.
# Exercice 2 Accuracy Scikit-learn
# Exercise 2 Accuracy Scikit-learn
The goal of this exercice is to learn to use `sklearn.metrics` to compute the accuracy.
The goal of this exercise is to learn to use `sklearn.metrics` to compute the accuracy.
1. Compute the accuracy using `sklearn.metrics` on `y_true` and `y_pred` below:
@ -68,9 +68,9 @@ y_true = [0, 0, 1, 1, 1, 1, 0]
# Exercice 3 Regression
# Exercise 3 Regression
The goal of this exercice is to learn to evaluate a machine learning model using many regression metrics.
The goal of this exercise is to learn to evaluate a machine learning model using many regression metrics.
Preliminary:
@ -138,13 +138,13 @@ pipe.fit(X_train, y_train)
MSE on the test set: 0.5537420654727396
```
This result shows that the model has slightly better results on the train set than the test set. That's frequent since it is easier to get a better grade on an exam we studied than an exam that is different from what was prepared. However, the results are not good: r2 ~ 0.3. Fitting non linear models as the Random Forest on this data may improve the results. That's the goal of the exercice 5.
This result shows that the model has slightly better results on the train set than the test set. That's frequent since it is easier to get a better grade on an exam we studied than an exam that is different from what was prepared. However, the results are not good: r2 ~ 0.3. Fitting non linear models as the Random Forest on this data may improve the results. That's the goal of the exercise 5.
# Exercice 4 Classification
# Exercise 4 Classification
The goal of this exercice is to learn to evaluate a machine learning model using many classification metrics.
The goal of this exercise is to learn to evaluate a machine learning model using many classification metrics.
Preliminary:
@ -232,9 +232,9 @@ Having a 99% ROC AUC is not usual. The data set we used is easy to classify. On
# Exercice 5 Machine Learning models
# Exercise 5 Machine Learning models
The goal of this exercice is to have an overview of the existing Machine Learning models and to learn to call them from scikit learn.
The goal of this exercise is to have an overview of the existing Machine Learning models and to learn to call them from scikit learn.
We will focus on:
- SVM/ SVC
@ -363,9 +363,9 @@ Take time to have basic understanding of the role of the basic hyperparameters a
It is important to notice that the Decision Tree overfits very easily. It learns easily the training data but is not able to extrapolate on the test set. This algorithm is not used a lot.
However, Random Forest and Gradient Boosting propose a solid approach to correct the overfitting (in that case the parameters `max_depth` is set to None that is why the Random Forest overfits the data). These two algorithms are used intensively in Machine Learning Projets.
# Exercice 6 Grid Search
# Exercise 6 Grid Search
The goal of this exercice is to learn how to make an exhaustive search over specified parameter values for an estimator. This is very useful because the hyperparameters which are the paremeters of the model impact the performance of the model.
The goal of this exercise is to learn how to make an exhaustive search over specified parameter values for an estimator. This is very useful because the hyperparameters which are the paremeters of the model impact the performance of the model.
The scikit learn object that runs the Grid Search is called GridSearchCV. We will learn tomorrow about the cross validation. For now, let us set the parameter **cv** to `[(np.arange(18576), np.arange(18576,20640))]`.
This means that GridSearchCV splits the data set in a train and test set.
@ -450,7 +450,7 @@ Ressources:
return gs.best_estimator_, gs.best_params_, gs.best_score_
```
In my case, the gridsearch parameters are not interesting. Even if I reduced the overfitting of the Random Forest, the score on the test is lower than the score on the test returned by the Gradient Boosting in the previous exercice without optimal parameters search.
In my case, the gridsearch parameters are not interesting. Even if I reduced the overfitting of the Random Forest, the score on the test is lower than the score on the test returned by the Gradient Boosting in the previous exercise without optimal parameters search.
3. This question is validated if the code used is:

10
one_md_per_day_format/piscine/Week2/template.md

@ -16,21 +16,21 @@
## Ressources
# Exercice 1
# Exercise 1
# Exercice 2
# Exercise 2
# Exercice 3
# Exercise 3
# Exercice 4
# Exercise 4
# Exercice 5
# Exercise 5

10
one_md_per_day_format/piscine/Week3/template.md

@ -16,21 +16,21 @@
## Ressources
# Exercice 1
# Exercise 1
# Exercice 2
# Exercise 2
# Exercice 3
# Exercise 3
# Exercice 4
# Exercise 4
# Exercice 5
# Exercise 5

18
one_md_per_day_format/piscine/Week3/w3day02.md

@ -10,7 +10,7 @@ The goal of this day is to learn to use Keras to build Neural Networks.
There are two ways to build Keras models: sequential and functional.
The sequential API allows you to create models layer-by-layer for most problems. It is limited in that it does not allow you to create models that share layers or have multiple inputs or outputs. The exercices focuses on the usage of the sequential API.
The sequential API allows you to create models layer-by-layer for most problems. It is limited in that it does not allow you to create models that share layers or have multiple inputs or outputs. The exercises focuses on the usage of the sequential API.
'2.4.3'
@ -25,9 +25,9 @@ A developper
## Ressources
https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
# Exercice 1 Sequential
# Exercise 1 Sequential
The goal of this exercice is to learn to call the object `Sequential`.
The goal of this exercise is to learn to call the object `Sequential`.
1. Put the object Sequential in a variable named `model` and print the variable `model`.
@ -39,9 +39,9 @@ The goal of this exercice is to learn to call the object `Sequential`.
# Exercice 2 Dense
# Exercise 2 Dense
The goal of this exercice is to learn to create layers of neurons. Keras proposes options to create custom layers. The neural networks build in these exercices do not require custom layers. `Dense` layers do the job. A dense layer is simply a layer where each unit or neuron is connected to each neuron in the next layer. As seen yesterday, there are three main types of layers: input, hidden and output. The **input layer** that specifies the number of inputs (features) is not represented as a layer in Keras. However, `Dense` has a parameter `input_dim` that gives the number of inputs in the previous layer. The output layer as any hidden layer can be created using `Dense`, the only difference is that the output layer contains one single neuron.
The goal of this exercise is to learn to create layers of neurons. Keras proposes options to create custom layers. The neural networks build in these exercises do not require custom layers. `Dense` layers do the job. A dense layer is simply a layer where each unit or neuron is connected to each neuron in the next layer. As seen yesterday, there are three main types of layers: input, hidden and output. The **input layer** that specifies the number of inputs (features) is not represented as a layer in Keras. However, `Dense` has a parameter `input_dim` that gives the number of inputs in the previous layer. The output layer as any hidden layer can be created using `Dense`, the only difference is that the output layer contains one single neuron.
1. Create a `Dense` layer with these parameters and return the output of `get_config`:
@ -121,9 +121,9 @@ The goal of this exercice is to learn to create layers of neurons. Keras propose
'bias_constraint': None}
```
# Exercice 3 Architecture
# Exercise 3 Architecture
The goal of this exercice is to combine the layers and to create a neural network.
The goal of this exercise is to combine the layers and to create a neural network.
1. Create a neural network for regression with the following architecture and return `print(model.summary())`:
@ -145,9 +145,9 @@ The goal of this exercice is to combine the layers and to create a neural networ
```
The first two layers could use another activation function that sigmoid (eg: relu)
# Exercice 4 Optimize
# Exercise 4 Optimize
The goal of this exercice is to learn to train the neural network. Once the architecture of the neural network is set there are two steps to train the neural network:
The goal of this exercise is to learn to train the neural network. Once the architecture of the neural network is set there are two steps to train the neural network:
- `compile`: The compilation step aims to set the loss function, to choose the algoithm to minimize the chosen loss function and to choose the metric the model outputs.

22
one_md_per_day_format/piscine/Week3/w3day03.md

@ -24,9 +24,9 @@ A developper
https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
# Exercice 1 Regression - Optimize
# Exercise 1 Regression - Optimize
The goal of this exercice is to learn to set up the optimization for a regression neural network. There's no code to run in that exercice. In W2D2E3, we implemented a neural network designed for regression. We will be using this neural network:
The goal of this exercise is to learn to set up the optimization for a regression neural network. There's no code to run in that exercise. In W2D2E3, we implemented a neural network designed for regression. We will be using this neural network:
```
model = keras.Sequential()
@ -68,9 +68,9 @@ https://keras.io/api/losses/regression_losses/
https://keras.io/api/metrics/regression_metrics/
# Exercice 2 Regression example
# Exercise 2 Regression example
The goal of this exercice is to learn to train a neural network to perform a regression on a data set.
The goal of this exercise is to learn to train a neural network to perform a regression on a data set.
The data set is Auto MPG Dataset and the go is to build a model to predict the fuel efficiency of late-1970s and early 1980s automobiles. To do this, provide the model with a description of many automobiles from that time period. This description includes attributes like: cylinders, displacement, horsepower, and weight.
https://www.tensorflow.org/tutorials/keras/regression
@ -150,9 +150,9 @@ The output neuron has to be `Dense(1)` - by defaut the activation funtion is lin
*Hint*: To get the score on the test set, `evaluate` could have been used: `model.evaluate(X_test_scaled, y_test)`.
# Exercice 3 Multi classification - Softmax
# Exercise 3 Multi classification - Softmax
The goal of this exercice is to learn to a neural network architecture for multi-class data. This is an important type of problem on which to practice with neural networks because the three class values require specialized handling. A multi-classification neural network uses as output layer a **softmax** layer. The **softmax** activation function is an extension of the sigmoid as it is designed to output the probabilities to belong to each class in a multi-class problem. This output layer has to contain as much neurons as classes in the multi-classification problem. This article explains in detail how it works. https://developers.google.com/machine-learning/crash-course/multi-class-neural-networks/softmax
The goal of this exercise is to learn to a neural network architecture for multi-class data. This is an important type of problem on which to practice with neural networks because the three class values require specialized handling. A multi-classification neural network uses as output layer a **softmax** layer. The **softmax** activation function is an extension of the sigmoid as it is designed to output the probabilities to belong to each class in a multi-class problem. This output layer has to contain as much neurons as classes in the multi-classification problem. This article explains in detail how it works. https://developers.google.com/machine-learning/crash-course/multi-class-neural-networks/softmax
Let us assume we want to classify images and we know they contain either apples, bears, candies, eggs or dogs (extension of the example in the link above).
@ -175,9 +175,9 @@ Let us assume we want to classify images and we know they contain either apples,
model.add(Dense(5, activation= 'softmax'))
```
# Exercice 4 Multi classification - Optimize
# Exercise 4 Multi classification - Optimize
The goal of this exercice is to learn to optimize a multi-classification neural network. As learnt previously, the loss function used in binary classification is the log loss - also called in Keras `binary_crossentropy`. This function is defined for binary classification and can be extended to multi-classfication. In Keras, the extended loss that supports multi-classification is `binary_crossentropy`. There's no code to run in that exercice.
The goal of this exercise is to learn to optimize a multi-classification neural network. As learnt previously, the loss function used in binary classification is the log loss - also called in Keras `binary_crossentropy`. This function is defined for binary classification and can be extended to multi-classfication. In Keras, the extended loss that supports multi-classification is `binary_crossentropy`. There's no code to run in that exercise.
1. Fill the chunk of code below in order to optimize the neural network defined in the previous exercise. Choose the adapted loss, adam as optimizer and the accuracy as metric.
@ -196,9 +196,9 @@ model.compile(loss='categorical_crossentropy',
metrics=['accuracy'])
```
# Exercice 5 Multi classification example
# Exercise 5 Multi classification example
The goal of this exercice is to learn to use a neural network to classify a multiclass data set. The data set used is the Iris data set which allows to classify flower given basic features as flower's measurement.
The goal of this exercise is to learn to use a neural network to classify a multiclass data set. The data set used is the Iris data set which allows to classify flower given basic features as flower's measurement.
Preliminary:
- Split train test. Keep 20% for the test set. Use `random_state=1`.
@ -245,6 +245,6 @@ model.fit(X_train_sc, y_train_multi_class, epochs = 1000, batch_size=20)
# Exercice 6 GridSearch
# Exercise 6 GridSearch
https://medium.com/@am.benatmane/keras-hyperparameter-tuning-using-sklearn-pipelines-grid-search-with-cross-validation-ccfc74b0ce9f

40
one_md_per_day_format/piscine/Week3/w3day04.md

@ -29,12 +29,12 @@ Les packages NLTK and Spacy to do the preprocessing
## Ressources
# Exercice 1: Lowercase
# Exercise 1: Lowercase
The goal of this exercice is to learn to lowercase text data in Python. Note that if the volume of data is low the text data can be stored in a Pandas DataFrame or Series. But, when dealing with high volumes (high but not huge), using a Pandas DataFrame or Series is not efficient. Data structures as dictionaries or list are more adapted.
The goal of this exercise is to learn to lowercase text data in Python. Note that if the volume of data is low the text data can be stored in a Pandas DataFrame or Series. But, when dealing with high volumes (high but not huge), using a Pandas DataFrame or Series is not efficient. Data structures as dictionaries or list are more adapted.
```
list_ = ["This is my first NLP exercice", "wtf!!!!!"]
list_ = ["This is my first NLP exercise", "wtf!!!!!"]
series_data = pd.Series(list_, name='text')
```
@ -46,21 +46,21 @@ Note: Do not change the text manually !
1. This question is validated if the output is:
```
0 this is my first nlp exercice
0 this is my first nlp exercise
1 wtf!!!!!
Name: text, dtype: object
```
2. This question is validated if the output is:
```
0 THIS IS MY FIRST NLP EXERCICE
0 THIS IS MY FIRST NLP EXERCISE
1 WTF!!!!!
Name: text, dtype: object
```
# Exerice 2: Punctation
The goal of this exerice is to learn to deal with punctuation. In Natural Language Processing, some basic approaches as Bag of Words (exercice X) model the text as an unordered combination of words. In that case the punctuation is not always useful as it doesn't add information to the model. That is why is removed.
The goal of this exerice is to learn to deal with punctuation. In Natural Language Processing, some basic approaches as Bag of Words (exercise X) model the text as an unordered combination of words. In that case the punctuation is not always useful as it doesn't add information to the model. That is why is removed.
1. Remove the punctuation from this sentence. All characters in !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ are considered as punctuation.
@ -81,9 +81,9 @@ The goal of this exerice is to learn to deal with punctuation. In Natural Langua
```
# Exercice 3 Tokenization
# Exercise 3 Tokenization
The goal of this exercice is to learn to tokenize as text. This step is important because it splits the text into token. A token could be a sentence or a word.
The goal of this exercise is to learn to tokenize as text. This step is important because it splits the text into token. A token could be a sentence or a word.
```
text = """Bitcoin is a cryptocurrency invented in 2008 by an unknown person or group of people using the name Satoshi Nakamoto. The currency began use in 2009 when its implementation was released as open-source software."""
@ -152,13 +152,13 @@ https://www.analyticsvidhya.com/blog/2019/07/how-get-started-nlp-6-unique-ways-p
```
# Exercice 4 Stop words
# Exercise 4 Stop words
The goal of this exercice is to learn to remove stop words with NLTK. Stop words usually refers to the most common words in a language. For example: "and", "is", "a" are stop words and do not add information to a sentence.
The goal of this exercise is to learn to remove stop words with NLTK. Stop words usually refers to the most common words in a language. For example: "and", "is", "a" are stop words and do not add information to a sentence.
```
text = """
The goal of this exercice is to learn to remove stop words with NLTK. Stop words usually refers to the most common words in a language.
The goal of this exercise is to learn to remove stop words with NLTK. Stop words usually refers to the most common words in a language.
"""
```
1. Remove stop words from this sentence and return the list of work tokens without stop words.
@ -168,13 +168,13 @@ The goal of this exercice is to learn to remove stop words with NLTK. Stop word
1. This question is validated if, using NLTK, the ouptut is:
```
['The', 'goal', 'exercice', 'learn', 'remove', 'stop', 'words', 'NLTK', '.', 'Stop', 'words', 'usually', 'refers', 'common', 'words', 'language', '.']
['The', 'goal', 'exercise', 'learn', 'remove', 'stop', 'words', 'NLTK', '.', 'Stop', 'words', 'usually', 'refers', 'common', 'words', 'language', '.']
```
# Exercice 5 Stemming
# Exercise 5 Stemming
The goal of this exercice is to learn to use stemming using NLTK. As explained in details in the article, stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language.
The goal of this exercise is to learn to use stemming using NLTK. As explained in details in the article, stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language.
Note: The output of a stemmer is a word that may not exist in the dictionnary.
@ -196,9 +196,9 @@ The interviewer interviews the president in an interview
```
# Exercice 6: Text preprocessing
# Exercise 6: Text preprocessing
The goal of this exercice is to learn to create a function to prepocess and clean a text using NLTK.
The goal of this exercise is to learn to create a function to prepocess and clean a text using NLTK.
Put this text in a variable:
@ -267,22 +267,22 @@ https://towardsdatascience.com/nlp-preprocessing-with-nltk-3c04ee00edc0
```
# Exercice 7: Bag of Word representation
# Exercise 7: Bag of Word representation
https://machinelearningmastery.com/gentle-introduction-bag-words-model/
The goal of this exercice is to understand how to create a Bag of Word (BoW) model on a corpus of texts. More precesily we will create a labeled data set from textual data using a word count matrix.
The goal of this exercise is to understand how to create a Bag of Word (BoW) model on a corpus of texts. More precesily we will create a labeled data set from textual data using a word count matrix.
As explained in the ressource, the Bag of word reprensation makes the assumption that the order in which the words appear in a text doesn't matter. There are different types of Bag of words reprensations:
- Boolean: Each document is a boolean vector
- Wordcount: Each document is a word count vector
- TFIDF: Each document is a score vector. The score is detailed in the next exercice.
- TFIDF: Each document is a score vector. The score is detailed in the next exercise.
The data `tweets_train.txt` contains tweets labeled with a sentiment. It gives the positivity of a tweet.
Steps:
1. Preprocess the data using the function implemented in the previous exercice. And, using from `CountVectorizer` of scikitlearn with `max_features=500` compute the wordcount of the tweets. The output is a sparse matrix.
1. Preprocess the data using the function implemented in the previous exercise. And, using from `CountVectorizer` of scikitlearn with `max_features=500` compute the wordcount of the tweets. The output is a sparse matrix.
- Check the shape of the word count matrix
- Set **max_features** to 500 of the initial size of the dictionnary.

24
one_md_per_day_format/piscine/Week3/w3day05.md

@ -19,9 +19,9 @@ There are many type of language models pre-trained in Spacy. Each has its specif
## Ressources
# Exercice 1 Embedding 1
# Exercise 1 Embedding 1
The goal of this exercice is to learn to load an embedding on SpaCy.
The goal of this exercise is to learn to load an embedding on SpaCy.
1. Install and load `en_core_web_sm` embedding. Compute the embedding of `car`.
@ -40,10 +40,10 @@ array([ 1.0522802e+00, 1.4806499e+00, 7.7402556e-01, 1.0373484e+00,
```
# Exercice 2: Tokenization
# Exercise 2: Tokenization
The goal of this exercice is to learn to tokenize a document using Spacy. We did this using NLTK yesterday.
The goal of this exercise is to learn to tokenize a document using Spacy. We did this using NLTK yesterday.
1. Tokenize the text below and print the tokens
@ -68,9 +68,9 @@ The goal of this exercice is to learn to tokenize a document using Spacy. We did
.
```
## Exercice 3 Embeddings 2
## Exercise 3 Embeddings 2
The goal of this exercice is to learn to use SpaCy embedding on a document.
The goal of this exercise is to learn to use SpaCy embedding on a document.
1. Compute the embedding of all the words in this sentence. The language model considered is `en_core_web_md`
@ -106,9 +106,9 @@ https://medium.com/datadriveninvestor/cosine-similarity-cosine-distance-6571387f
[logo]: w3day05ex1_plot.png "Plot"
# Exercice 4 Sentences' similarity
# Exercise 4 Sentences' similarity
The goal of this exerice is to learn to compute the similarity between two sentences. As explained in the documentation: **The word embedding of a full sentence is simply the average over all different words**. This is how `similarity` works in SpaCy. This small use case is very interesting because if we build a corpus of sentences that express an intention as **buy shoes**, then we can detect this intention and use it to propose shoes advertisement for customers. The language model used in this exercice is `en_core_web_sm`.
The goal of this exerice is to learn to compute the similarity between two sentences. As explained in the documentation: **The word embedding of a full sentence is simply the average over all different words**. This is how `similarity` works in SpaCy. This small use case is very interesting because if we build a corpus of sentences that express an intention as **buy shoes**, then we can detect this intention and use it to propose shoes advertisement for customers. The language model used in this exercise is `en_core_web_sm`.
1. Compute the similarities (3 in total) between these sentences:
@ -135,9 +135,9 @@ The goal of this exerice is to learn to compute the similarity between two sente
# Exercice 5: NER
# Exercise 5: NER
The goal of this exercice is to learn to use a Named entity recognition algorithm to detect entities.
The goal of this exercise is to learn to use a Named entity recognition algorithm to detect entities.
```
Apple Inc. is an American multinational technology company headquartered in Cupertino, California, that designs, develops, and sells consumer electronics, computer software, and online services. It is considered one of the Big Five companies in the U.S. information technology industry, along with Amazon, Google, Microsoft, and Facebook.
@ -189,9 +189,9 @@ https://en.wikipedia.org/wiki/Named-entity_recognition
```
# Exercice 6 Part-of-speech tags
# Exercise 6 Part-of-speech tags
The goal od this exercice is to learn to use the Part-of-speech tags (**POS TAG**) using Spacy. As explained in wikipedia, the POS TAG is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context.
The goal od this exercise is to learn to use the Part-of-speech tags (**POS TAG**) using Spacy. As explained in wikipedia, the POS TAG is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context.
Example

20
one_md_per_day_format/piscine/Week3/w3day1.md

@ -23,9 +23,9 @@ https://srnghn.medium.com/deep-learning-overview-of-neurons-and-activation-funct
Reproduire cet article sans back prop
https://towardsdatascience.com/machine-learning-for-beginners-an-introduction-to-neural-networks-d49f22d238f9
# Exercice 1 The neuron
# Exercise 1 The neuron
The goal of this exercice is to understand the role of a neuron and to implement a neuron.
The goal of this exercise is to understand the role of a neuron and to implement a neuron.
An artificial neuron, the basic unit of the neural network, (also referred to as a perceptron) is a mathematical function. It takes one or more inputs that are multiplied by values called “weights” and added together. This value is then passed to a non-linear function, known as an activation function, to become the neuron’s output.
@ -91,7 +91,7 @@ https://victorzhou.com/blog/intro-to-neural-networks/
# Exerice 2 Neural network
The goal of this exercice is to understand how to combine three neurons to form a neural network. A neural newtwork is nothing else than neurons connected together. As shown in the figure the neural network is composed of **layers**:
The goal of this exercise is to understand how to combine three neurons to form a neural network. A neural newtwork is nothing else than neurons connected together. As shown in the figure the neural network is composed of **layers**:
- Input layer: it only represents input data. **It doesn't contain neurons**.
- Output layer: it represents the last layer. It contains a neuron (in some cases more than 1).
@ -99,7 +99,7 @@ The goal of this exercice is to understand how to combine three neurons to form
Notice that the neuron **o1** in the output layer takes as input the output of the neurons **h1** and **h2** in the hidden layer.
In exercice 1, you implemented this neuron.
In exercise 1, you implemented this neuron.
![alt text][neuron]
[neuron]: images/day1/ex2/w3_day1_neuron.png "Plot"
@ -143,9 +143,9 @@ Now, we add two more neurons:
1. This question is validated the output is: **0.9524917424084265**
# Exercice 3 Log loss
# Exercise 3 Log loss
The goal of this exercice is to implement the Log loss function. As mentioned last week, this function is used in classification as a **loss function**. It means that the better the classifier is, the smaller the loss function is. W2D1, you implemented the gradient descent on the MSE loss to update the weights of the linear regression. Similarly, the minimization of the Log loss leads to finding optimal weights.
The goal of this exercise is to implement the Log loss function. As mentioned last week, this function is used in classification as a **loss function**. It means that the better the classifier is, the smaller the loss function is. W2D1, you implemented the gradient descent on the MSE loss to update the weights of the linear regression. Similarly, the minimization of the Log loss leads to finding optimal weights.
Log loss: - 1/n * Sum[(y_true*log(y_pred) + (1-y_true)*log(1-y_pred))]
@ -163,7 +163,7 @@ https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html
1. This question is validated if the output is: **0.5472899351247816**.
# Exercice 4 Forward propagation
# Exercise 4 Forward propagation
The goal of this exerice is to compute the log loss on the output of the forward propagation. The data used is the tiny data set below.
@ -198,9 +198,9 @@ The goal if the network is to predict the success at the exam given math and che
2. This question is validated if the logloss for the 4 students is **0.5485133607757963**.
# Exercice 5 Regression
# Exercise 5 Regression
The goal of this exercice is to learn to adapt the output layer to regression.
The goal of this exercise is to learn to adapt the output layer to regression.
As a reminder, one of reasons for which the sigmoid is used in classification is because it contracts the output between 0 and 1 which is the expected output range for a probability (W2D2: Logistic regression). However, the output of the regression is not a probability.
In order to perform a regression using a neural network, the activation function of the neuron on the output layer has to be modified to **identity function**. In mathematics, the identity function is: **f(x) = x**. In other words it means that it returns the input as so. The three steps become:
@ -218,7 +218,7 @@ In order to perform a regression using a neural network, the activation function
All other neurons' activation function **doesn't change**.
1. Adapt the neuron class implemented in exercice 1. It now takes as a parameter `regression` which is boolean. When its value is `True`, `feedforward` should use the identity function as activation function instead of the sigmoid function.
1. Adapt the neuron class implemented in exercise 1. It now takes as a parameter `regression` which is boolean. When its value is `True`, `feedforward` should use the identity function as activation function instead of the sigmoid function.
```

Loading…
Cancel
Save