Browse Source

Merge pull request #1 from 01-edu/day01-test

day1: testing and feedback
pull/4/head
brad-gh 3 years ago committed by GitHub
parent
commit
5ea3de8de3
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
  1. 1600
      one_md_per_day_format/piscine/Week1/data/D01/ex8/winequality-red.csv
  2. 72
      one_md_per_day_format/piscine/Week1/data/D01/ex8/winequality.names
  3. 0
      one_md_per_day_format/piscine/Week1/data/D01/ex9/model_forecasts.txt
  4. 321
      one_md_per_day_format/piscine/Week1/day1.md
  5. 24
      one_md_per_day_format/piscine/Week1/day2.md
  6. 24
      one_md_per_day_format/piscine/Week1/day3.md
  7. 26
      one_md_per_day_format/piscine/Week1/day4.md
  8. 20
      one_md_per_day_format/piscine/Week1/day5.md
  9. 32
      one_md_per_day_format/piscine/Week2/day03.md
  10. 22
      one_md_per_day_format/piscine/Week2/day05.md
  11. 22
      one_md_per_day_format/piscine/Week2/day1.md
  12. 28
      one_md_per_day_format/piscine/Week2/day2.md
  13. 28
      one_md_per_day_format/piscine/Week2/day4.md
  14. 10
      one_md_per_day_format/piscine/Week2/template.md
  15. 10
      one_md_per_day_format/piscine/Week3/template.md
  16. 18
      one_md_per_day_format/piscine/Week3/w3day02.md
  17. 22
      one_md_per_day_format/piscine/Week3/w3day03.md
  18. 40
      one_md_per_day_format/piscine/Week3/w3day04.md
  19. 24
      one_md_per_day_format/piscine/Week3/w3day05.md
  20. 20
      one_md_per_day_format/piscine/Week3/w3day1.md

1600
one_md_per_day_format/piscine/Week1/data/D01/ex8/winequality-red.csv

File diff suppressed because it is too large diff.load

72
one_md_per_day_format/piscine/Week1/data/D01/ex8/winequality.names

@ -0,0 +1,72 @@
Citation Request:
This dataset is public available for research. The details are described in [Cortez et al., 2009].
Please include this citation if you plan to use this database:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016
[Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf
[bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib
1. Title: Wine Quality
2. Sources
Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009
3. Past Usage:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
In the above reference, two datasets were created, using red and white wine samples.
The inputs include objective tests (e.g. PH values) and the output is based on sensory data
(median of at least 3 evaluations made by wine experts). Each expert graded the wine quality
between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model
these datasets under a regression approach. The support vector machine model achieved the
best results. Several metrics were computed: MAD, confusion matrix for a fixed error tolerance (T),
etc. Also, we plot the relative importances of the input variables (as measured by a sensitivity
analysis procedure).
4. Relevant Information:
The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine.
For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009].
Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables
are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
These datasets can be viewed as classification or regression tasks.
The classes are ordered and not balanced (e.g. there are munch more normal wines than
excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent
or poor wines. Also, we are not sure if all input variables are relevant. So
it could be interesting to test feature selection methods.
5. Number of Instances: red wine - 1599; white wine - 4898.
6. Number of Attributes: 11 + output attribute
Note: several of the attributes may be correlated, thus it makes sense to apply some sort of
feature selection.
7. Attribute information:
For more information, read [Cortez et al., 2009].
Input variables (based on physicochemical tests):
1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
8. Missing Attribute Values: None

0
one_md_per_day_format/piscine/Week1/data/D01/ex6/model_forecasts.txt → one_md_per_day_format/piscine/Week1/data/D01/ex9/model_forecasts.txt

321
one_md_per_day_format/piscine/Week1/day1.md

@ -1,38 +1,43 @@
# D01 Piscine AI - Data Science
# D01 Piscine AI - Data Science
The goal of this day is to understand practical usage of **NumPy**. **NumPy** is a commonly used Python data analysis package. By using **NumPy**, you can speed up your workflow, and interface with other packages in the Python ecosystem, like scikit-learn, that use **NumPy** under the hood. **NumPy** was originally developed in the mid 2000s, and arose from an even older package called Numeric. This longevity means that almost every data analysis or machine learning package for Python leverages **NumPy** in some way.
Version of NumPy I used to do the exercices: 1.18.1
Version of NumPy I used to do the exercises: 1.18.1
I suggest to use the most recent one.
Author:
Author:
<div style="page-break-after: always"></div>
# Outline: (optional)
A. Introduction
B. Rules
C. Exercices
C. Exercises
## Rules
... Notebook Colabs or Jupyter Notebook
Save one notebook per day or one per exercice. Use markdown to divide your notebook in different exercices.
## Ressources
... Notebook Colabs or Jupyter Notebook
Save one notebook per day or one per exercise. Use markdown to divide your notebook in different exercises.
## Ressources
- https://medium.com/fintechexplained/why-should-we-use-NumPy-c14a4fb03ee9
- https://docs.scipy.org/doc/NumPy-1.15.0/reference/
- https://numpy.org/doc/
- https://jakevdp.github.io/PythonDataScienceHandbook/
# Exercice 1 Your first NumPy array
# Exercise 1 Your first NumPy array
The goal of this exercice is to use many Python data types in **NumPy** arrays. **NumPy** arrays are intensively used in **NumPy** and **Pandas**. They are flexible and allow to use optimized **NumPy** underlying functions.
The goal of this exercise is to use many Python data types in **NumPy** arrays. **NumPy** arrays are intensively used in **NumPy** and **Pandas**. They are flexible and allow to use optimized **NumPy** underlying functions.
1. Create a NumPy array that contains: an integer, a float, a string, a dictionary, a list, a tuple, a set and a boolean.
1. Create a NumPy array that contains: an integer, a float, a string, a dictionary, a list, a tuple, a set and a boolean.
The expected output is:
The expected output is:
```
```python
for i in your_np_array:
print(type(i))
@ -44,13 +49,13 @@ for i in your_np_array:
<class 'tuple'>
<class 'set'>
<class 'bool'>
```
## Correction
1. This question is validated if the your_numpy_array is a NumPy array. It can be checked with `type(your_numpy_array)` that should be equal to `numpy.ndarray`. And if the type of is element are as follow.
## Correction
1. This question is validated if the your_numpy_array is a NumPy array. It can be checked with `type(your_numpy_array)` that should be equal to `numpy.ndarray`. And if the type of is element are as follow.
```
```python
for i in your_np_array:
print(type(i))
@ -62,27 +67,30 @@ for i in your_np_array:
<class 'tuple'>
<class 'set'>
<class 'bool'>
```
---
# Exercice 2 Zeros
The goal of this exercice is to learn to create a NumPy array with 0s.
# Exercise 2 Zeros
1. Create a NumPy array of dimension **300** with zeros without filling it manually
The goal of this exercise is to learn to create a NumPy array with 0s.
1. Create a NumPy array of dimension **300** with zeros without filling it manually
2. Reshape it to **(3,100)**
## Correction
1. The question is validated is the solution uses `np.zeros` and if the shape of the array is `(300,)`
2. The question is validated if the solution uses `reshape` and the shape of the array is **(3, 100)**
2. The question is validated if the solution uses `reshape` and the shape of the array is `(3, 100)`
---
# Exercice 3 Slicing
The goal of this exercice is to learn NumPy indexing/slicing. It allows to access values of the NumPy array efficiently and without a for loop.
# Exercise 3 Slicing
The goal of this exercise is to learn NumPy indexing/slicing. It allows to access values of the NumPy array efficiently and without a for loop.
1. Create a NumPy array of dimension 1 that contains all integers from 1 to 100 ordered.
1. Create a NumPy array of dimension 1 that contains all integers from 1 to 100 ordered.
2. Without using a for loop and using the array created in Q1, create an array that contain all odd integers. The expected output is: `np.array([1,3,...,99])`. *Hint*: it takes one line
3. Without using a for loop and using the array created in Q1, create an array that contain all even integers reversed. The expected output is: `np.array([100,98,...,2])`. *Hint*: it takes one line
@ -90,47 +98,50 @@ The goal of this exercice is to learn NumPy indexing/slicing. It allows to acces
## Correction
1. This question is validated if the solution doesn't involve a for loop or writing all integers from 1 to 100 and if the array is: `np.array([1,...,100])`. The list from 1 to 100 can be generated with an iterator: `range`.
1. This question is validated if the solution doesn't involve a for loop or writing all integers from 1 to 100 and if the array is: `np.array([1,...,100])`. The list from 1 to 100 can be generated with an iterator: `range`.
2. This question is validated if the solution is: `integers[1::2]`
2. This question is validated if the solution is: `integers[::2]`
3. This question is validated if the solution is: `integers[::-2]`
4. This question is validated if the array is: `np.array([[1,0,3,4,0,...,0,99,100]])`. There are at least two ways to get this results without for loop. The first one uses `integers[1::3] = 0` and the second involves creating a boolean array that indexes the array:
4. This question is validated if the array is: `np.array([0, 1,0,3,4,0,...,0,99,100])`. There are at least two ways to get this results without for loop. The first one uses `integers[1::3] = 0` and the second involves creating a boolean array that indexes the array:
```python
mask = (integers+1)%3 == 0
integers[mask] = 0
```
```
mask = (integers+1)%3 == 0
integers[mask] = 0
```
---
# Exercice 4 Random
The goal of this exercice is to learn to generate random data.
In Data Science it is extremely useful to generate random data for many reasons:
Lack of real data, create a random benchmark, use varied data sets.
NumPy proposes a lot of options to generate random data. In statistics, assumptions are made on the distribution the data is from. All data distribution that can be generated randomly are described in the documentation. In this exerice we will focus on two distributions:
# Exercise 4 Random
The goal of this exercise is to learn to generate random data.
In Data Science it is extremely useful to generate random data for many reasons:
Lack of real data, create a random benchmark, use varied data sets.
NumPy proposes a lot of options to generate random data. In statistics, assumptions are made on the distribution the data is from. All data distribution that can be generated randomly are described in the documentation. In this exercise we will focus on two distributions:
- Uniform: For example, if your goal is to generate a random number from 1 to 100 and that the probability that all the numbers is equal you'll need the uniform distribution. NumPy provides `randint` and `uniform` to generate uniform distribution
- Normal: The normal distribution is the most important probability distribution in statistics because it fits many natural phenomena.For example, if you need to generate a data sample that represents **Heights of 14 Year Old Girls** it can be done using the normal distribution. In that case, we need two parameters: the mean (1m51) and the standard deviation (0.0741m). NumPy provides `randn` to generate normal distribution (among other)
https://docs.scipy.org/doc/NumPy-1.15.0/reference/routines.random.html
https://numpy.org/doc/stable/reference/random/generator.html
1. Set the seed to 888
2. Generate a **one-dimensional** array of size 100 with a normal distribution
2. Generate a **one-dimensional** array of size 100 with a normal distribution
3. Generate a **two-dimensional** array of size 8,8 with random integers from 1 to 10 - both included (same probability for each integer)
4. Generate a **three-dimensional** of size 4,2,5 array with random integers from 1 to 17 - both included (same probability for each integer)
## Correction
## Correction:
For this exercice, as the results may change depending on the version of the package or the OS, I give the code to correct the exercice. If the code is correct and the output is not the same as mine, it is accepted.
For this exercise, as the results may change depending on the version of the package or the OS, I give the code to correct the exercise. If the code is correct and the output is not the same as mine, it is accepted.
1. The solution is accepted if the solution is: `np.random.seed(888)`
2. The solution is accepted if the solution is `np.random.randn(100)`. The value of the first element is `0.17620087373662233`.
2. The solution is accepted if the solution is `np.random.randn(100)`. The value of the first element is `0.17620087373662233`.
3. The solution is accepted if the solution is `np.random.randint(1,11,(8,8))`.
```
```console
Given the NumPy version and the seed, you should have this output:
array([[ 7, 4, 8, 10, 2, 1, 1, 10],
@ -141,10 +152,11 @@ For this exercice, as the results may change depending on the version of the pac
[ 4, 1, 9, 7, 1, 4, 3, 5],
[ 3, 2, 10, 8, 6, 3, 9, 4],
[ 4, 4, 9, 2, 8, 5, 9, 5]])
```
```
4. The solution is accepted if the solution is `np.random.randint(1,18,(4,2,5))`.
```
```console
Given the NumPy version and the seed, you should have this output:
array([[[14, 16, 8, 15, 14],
@ -158,52 +170,58 @@ For this exercice, as the results may change depending on the version of the pac
[[ 3, 10, 5, 16, 13],
[17, 12, 9, 7, 16]]])
```
```
---
# Exercice 5: Split, contenate, reshape arrays
The goal of this exercice is to learn to concatenate and reshape arrays.
# Exercise 5: Split, concatenate, reshape arrays
The goal of this exercise is to learn to concatenate and reshape arrays.
1. Generate an array with integers from 1 to 50: `array([1,...,50])`
2. Generate an array with integers from 51 to 100: `array([51,...,100])`
3. Using `np.concatenate`, concatenate the two arrays into: `array([1,...,100])`
4. Reshape the previous array into:
```
4. Reshape the previous array into:
```console
array([[ 1, ... , 10],
...
[ 91, ... , 100]])
```
## Correction:
## Correction
1. This question is validated if the generated array is based on an iterator as `range` or `np.arange`. Check that 50 is part of the array.
1. This question is validated if the generated array is based on an iterator as `range` or `np.arange`. Check that 50 is part of the array.
2. This question is validated if the generated array is based on an iterator as `range` or `np.arange`. Check that 100 is part of the array.
2. This question is validated if the generated array is based on an iterator as `range` or `np.arange`. Check that 100 is part of the array.
3. This question is validated if you concatenated this way `np.concatenate(array1,array2)`.
4. This question is validated if the result is:
```
```console
array([[ 1, ... , 10],
...
[ 91, ... , 100]])
```
The easiest way is to use `array.reshape(10,10)`.
https://jakevdp.github.io/PythonDataScienceHandbook/02.02-the-basics-of-NumPy-arrays.html
The easiest way is to use `array.reshape(10,10)`.
https://jakevdp.github.io/PythonDataScienceHandbook/ (section: The Basics of NumPy Arrays)
---
# Exercice 6: Broadcasting and Slicing
The goal of this exercice is to learn to access values of n-dimensional arrays and efficiently.
# Exercise 6: Broadcasting and Slicing
The goal of this exercise is to learn to access values of n-dimensional arrays efficiently.
1. Create an 2-dimensional array size 9,9 of 1s. Each value has to be an `int8`.
2. Using **slicing**, output this array:
```
```python
array([[1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 0, 0, 0, 0, 0, 0, 0, 1],
[1, 0, 1, 1, 1, 1, 1, 0, 1],
@ -215,16 +233,16 @@ The goal of this exercice is to learn to access values of n-dimensional arrays a
[1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int8)
```
https://jakevdp.github.io/PythonDataScienceHandbook/02.05-computation-on-arrays-broadcasting.html
https://jakevdp.github.io/PythonDataScienceHandbook/ (section: Computation on Arrays: Broadcasting)
## Correction
## Correction
1. The question is validated if the output is the same as:
1. The question is validated if the output is the same as:
`np.ones([9,9], dtype=np.int8)`
2. The question is validated if the ouput is
2. The question is validated if the output is
```
```console
array([[1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 0, 0, 0, 0, 0, 0, 0, 1],
[1, 0, 1, 1, 1, 1, 1, 0, 1],
@ -235,96 +253,108 @@ https://jakevdp.github.io/PythonDataScienceHandbook/02.05-computation-on-arrays-
[1, 0, 0, 0, 0, 0, 0, 0, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int8)
```
The solution is not accepted if the values of the array have been changed one by one manually. The usage of the for loop is not allowed neither.
Here is an example of solution:
```
The solution is not accepted if the values of the array have been changed one by one manually. The usage of the for loop is not allowed neither.
Here is an example of a possible solution:
```console
x[1:8,1:8] = 0
x[2:7,2:7] = 1
x[3:6,3:6] = 0
x[4,4] = 1
```
---
# Exercice 7: NaN
The goal of this exercice is to learn to deal with missing data in NumPy and to manipulate NumPy arrays.
# Exercise 7: NaN
Let us consider a 2-dimensional array that contains the grades at the past two exams. Some of the students missed the first exam. As the grade is missing it has been replaced with a NaN.
The goal of this exercise is to learn to deal with missing data in NumPy and to manipulate NumPy arrays.
1. Using `np.where` create a third column that is equal to the grade of the first exam if it exists and the second else. Add the column as the third column of the array.
Let us consider a 2-dimensional array that contains the grades at the past two exams. Some of the students missed the first exam. As the grade is missing it has been replaced with a `NaN`.
**Using a for loop or if/else statement is not allowed in this exercice.**
1. Using `np.where` create a third column that is equal to the grade of the first exam if it exists and the second else. Add the column as the third column of the array.
```
**Using a for loop or if/else statement is not allowed in this exercise.**
```python
import numpy as np
generator = np.random.default_rng(123)
grades = np.round(generator.uniform(low = 0.0, high = 10.0, size = (10, 2)))
grades[[1,2,5,7], [0,0,0,0]] = np.nan
print(grades)
```
## Correction
1. There are two steps in this exercise:
- Create the vector that contains the grade of the first exam if available or the second. This can be done using `np.where`:
```python
np.where(np.isnan(grades[:, 0]), grades[:, 1], grades[:, 0])
```
## Correction
- Add this vector as third column of the array. Here are two ways:
1. There are two steps in this exercice:
- Create the vector that contains the the grade of the first exam if available or the second. This can be done using `np.where`:
```
np.where(np.isnan(grades[:, 0]), grades[:, 1], grades[:, 0])
```
- Add this vector as third column of the array. Here are two ways:
```
np.insert(arr = grades, values = new_vector, axis = 1, obj = 2)
```python
np.insert(arr = grades, values = new_vector, axis = 1, obj = 2)
np.hstack((grades, new_vector[:, None]))
```
np.hstack((grades, new_vector[:, None]))
```
This question is validated if, without having used a for loop or having filled the array manually, the output is:
This question is validated if, without having used a for loop or having filled the array manually, the output is:
This question is validated if, without having used a for loop or having filled the array manually, the output is:
```console
[[ 7. 1. 7.]
[nan 2. 2.]
[nan 8. 8.]
[ 9. 3. 9.]
[ 8. 9. 8.]
[nan 2. 2.]
[ 8. 2. 8.]
[nan 6. 6.]
[ 9. 2. 9.]
[ 8. 5. 8.]]
```
```
[[ 7. 1. 7.]
[nan 2. 2.]
[nan 8. 8.]
[ 9. 3. 9.]
[ 8. 9. 8.]
[nan 2. 2.]
[ 8. 2. 8.]
[nan 6. 6.]
[ 9. 2. 9.]
[ 8. 5. 8.]]
```
https://jakevdp.github.io/PythonDataScienceHandbook/02.02-the-basics-of-NumPy-arrays.html
---
# Exercice 8: Wine
The goal of this exercice is to learn to perform a basic data analysis on real data using NumPy.
# Exercise 8: Wine
The data set that will be used for this exercice is the wine data set.
https://archive.ics.uci.edu/ml/datasets/wine+quality
The goal of this exercise is to learn to perform a basic data analysis on real data using NumPy.
The data set that will be used for this exercise is the red wine data set.
https://archive.ics.uci.edu/ml/datasets/wine+quality
How to tell if a given 2D array has null columns?
1. Using `genfromtxt` load the data and reduce the size of the numpy array by optimizing the types. The sum of absolute differences between the original data set and the "memory" optimized one has to be smaller than 1.10**-3. I suggest to use `np.float32`. Check that the numpy array weights **76800 bytes**.
2. Print 2nd, 7th and 12th rows as a two dimensional array
2. Print 2nd, 7th and 12th rows as a two dimensional array
3. Is there any wine with a percentage of alcohol greater than 20% ? Return True or False
4. What is the average % of alcohol on all wines in the data set ? If needed, drop `np.nan` values
5. Compute the minimum, the maximum, the 25th percentile, the 75 percentile, the median of the pH
6. Compute the average quality of the wines having the 20% least sulphates
7. Compute the mean of all variables for wines having the best quality. Same question for the wines having the worst quality
## Correction
5. Compute the minimum, the maximum, the 25th percentile, the 50th percentile, the 75th percentile, the median (50th percentile) of the pH
1. This question is validated if the text file has successfully been loaded in a NumPy array with
` genfromtxt('winequality-red.csv', delimiter=',')` and the reduced arrays weights **76800 bytes**
6. Compute the average quality of the wines having the 20% least sulphates
2. This question is validated if the output is
7. Compute the mean of all variables for wines having the best quality. Same question for the wines having the worst quality
## Correction
```
1. This question is validated if the text file has successfully been loaded in a NumPy array with
`genfromtxt('winequality-red.csv', delimiter=',')` and the reduced arrays weights **76800 bytes**
2. This question is validated if the output is
```python
array([[ 7.4 , 0.7 , 0. , 1.9 , 0.076 , 11. , 34. ,
0.9978, 3.51 , 0.56 , 9.4 , 5. ],
[ 7.4 , 0.66 , 0. , 1.8 , 0.075 , 13. , 40. ,
@ -332,15 +362,16 @@ How to tell if a given 2D array has null columns?
[ 6.7 , 0.58 , 0.08 , 1.8 , 0.097 , 15. , 65. ,
0.9959, 3.28 , 0.54 , 9.2 , 5. ]])
```
This slicing gives the answer `my_data[[1,6,11],:]`.
3. This question is validated if the answer if False. There many ways to get the answer: find the maximum or check values greater than 20.
This slicing gives the answer `my_data[[1,6,11],:]`.
3. This question is validated if the answer if False. There many ways to get the answer: find the maximum or check values greater than 20.
4. This question is validated if the answer is 10.422983114446529.
4. This question is validated if the answer is 10.422983114446529.
5. This question is validated if the answers is:
```
```console
pH stats
25 percentile: 3.21
50 percentile: 3.31
@ -349,58 +380,62 @@ How to tell if a given 2D array has null columns?
min: 2.74
max: 4.01
```
*Note: Using `percentile` or `median` may give different results depending on the duplicate values in the column. If you do not have my results please use `percentile`.*
6. This question is validated if the answer is `5.222222222222222`. The first step is to get the percentile 20% of the column `sulphates`, then create a boolean array that contains `True` of the value is smaller than the percentile 20%, then select this rows with the column quality and compute the `mean`.
7. This question is validated if the output for the best wines is:
> *Note: Using `percentile` or `median` may give different results depending on the duplicate values in the column. If you do not have my results please use `percentile`.*
```
array([ 8.56666667, 0.42333333, 0.39111111, 2.57777778, 0.06844444,
13.27777778, 33.44444444, 0.99521222, 3.26722222, 0.76777778,
12.09444444, 8. ])
```
6. This question is validated if the answer is ~`5.2`. The first step is to get the percentile 20% of the column `sulphates`, then create a boolean array that contains `True` of the value is smaller than the percentile 20%, then select this rows with the column quality and compute the `mean`.
And the output for the bad wines is:
7. This question is validated if the output for the best wines is:
```python
array([ 8.56666667, 0.42333333, 0.39111111, 2.57777778, 0.06844444,
13.27777778, 33.44444444, 0.99521222, 3.26722222, 0.76777778,
12.09444444, 8. ])
```
```
array([ 8.36 , 0.8845 , 0.171 , 2.635 , 0.1225 , 11. ,
24.9 , 0.997464, 3.398 , 0.57 , 9.955 , 3. ])
```
And the output for the bad wines is:
```python
array([ 8.36 , 0.8845 , 0.171 , 2.635 , 0.1225 , 11. ,
24.9 , 0.997464, 3.398 , 0.57 , 9.955 , 3. ])
```
This can be done in three steps: Get the max, create a boolean mask that indicates rows with max quality, use this mask to subset the rows with the best quality and compute the mean on the axis 0.
This can be done in three steps: Get the max, create a boolean mask that indicates rows with max quality, use this mask to subset the rows with the best quality and compute the mean on the axis 0.
---
## Exercice 9 Football tournament
The goal of this exercice is to learn to use permutations, complex
## Exercise 9 Football tournament
A Football tournament is organized in your city. There are 10 teams and the director of the tournaments wants you to create a first round as exciting as possible. To do so, you are allowed to choose the pairs. As a former data scientist, you implemented a model based on teams' current season performance. This models predicts the score difference between two teams. You used this algorithm to predict the score difference for every possible pair.
The matrix returned is a 2-dimensional array that contains in (i,j) the score difference between team i and j. The matrix is in `model_forecasts.txt`.
The goal of this exercise is to learn to use permutations, complex
Using this output, what are the pairs that will give the most intersting matches ?
A Football tournament is organized in your city. There are 10 teams and the director of the tournaments wants you to create a first round as exciting as possible. To do so, you are allowed to choose the pairs. As a former data scientist, you implemented a model based on teams' current season performance. This models predicts the score difference between two teams. You used this algorithm to predict the score difference for every possible pair.
The matrix returned is a 2-dimensional array that contains in (i,j) the score difference between team i and j. The matrix is in `model_forecasts.txt`.
If a team wins 7-1 the match is obviously less exciting than a match where the winner wins 2-1.
The criteria that correponds to **the pairs that will give the most intersting matches** is **the pairs that minimize the sum of squared differences**
Using this output, what are the pairs that will give the most interesting matches ?
The expected output is:
If a team wins 7-1 the match is obviously less exciting than a match where the winner wins 2-1.
The criteria that corresponds to **the pairs that will give the most interesting matches** is **the pairs that minimize the sum of squared differences**
```
The expected output is:
```console
[[m1_t1 m2_t1 m3_t1 m4_t1 m5_t1]
[m1_t2 m2_t2 m3_t2 m4_t2 m5_t2]]
```
- m1_t1 stands for match1_team1
- m1_t1 plays against m1_t2 ...
- m1_t1 stands for match1_team1
- m1_t1 plays against m1_t2 ...
**Usage of for loop is not allowed, you may need to use the library** `itertools` **to create permutations**
https://docs.python.org/3.9/library/itertools.html
## Correction
This exercice is validated if the output is:
This exercise is validated if the output is:
```
```console
[[0 3 1 2 4]
[7 6 8 9 5]]
```

24
one_md_per_day_format/piscine/Week1/day2.md

@ -17,7 +17,7 @@ Not only is the Pandas library a central component of the data science toolkit b
Pandas is built on top of the NumPy package, meaning a lot of the structure of NumPy is used or replicated in Pandas. Data in pandas is often used to feed statistical analysis in SciPy, plotting functions from Matplotlib, and machine learning algorithms in Scikit-learn.
Most of the topics we will cover today are explained and describes with examples in the first ressource. The number of exercices is low on purpose: Take the time to understand the chapter 5 of the ressource, even if there are 40 pages.
Most of the topics we will cover today are explained and describes with examples in the first ressource. The number of exercises is low on purpose: Take the time to understand the chapter 5 of the ressource, even if there are 40 pages.
The version of Pandas I used is '1.0.1'.
@ -41,9 +41,9 @@ https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/
https://jakevdp.github.io/PythonDataScienceHandbook/03.04-missing-values.html
# Exercice 1
# Exercise 1
The goal of this exercice is to learn to create basic Pandas objects.
The goal of this exercise is to learn to create basic Pandas objects.
1. Create a DataFrame as below this using two ways:
- From a NumPy array
@ -82,9 +82,9 @@ and if the types of the first value of the columns are
```
# Exercice 2 **Electric power consumption**
# Exercise 2 **Electric power consumption**
The goal of this exercice is to learn to manipulate real data with Pandas.
The goal of this exercise is to learn to manipulate real data with Pandas.
The data set used is **Individual household electric power consumption**
1. Delete the columns `Time`, `Sub_metering_2` and `Sub_metering_3`
@ -118,7 +118,7 @@ The data set used is **Individual household electric power consumption**
## Correction:
1. `del` works but it is not a solution I recommand. For this exercice it is accepted. It is expected to use `drop` with `axis=1`. `inplace=True` may be useful to avoid to affect the result to a variable.
1. `del` works but it is not a solution I recommand. For this exercise it is accepted. It is expected to use `drop` with `axis=1`. `inplace=True` may be useful to avoid to affect the result to a variable.
2. The prefered solution is `set_index` with `inplace=True`. As long as the DataFrame returns the output below, the solution is accepted. If the type of the index is not `dtype='datetime64[ns]'` the solution is not accepted.
@ -219,9 +219,9 @@ The data set used is **Individual household electric power consumption**
# Exercice 3: E-commerce purchases
# Exercise 3: E-commerce purchases
The goal of this exercice is to learn to manipulate real data with Pandas. This exercice is less guided since the exercice 2 should have given you a nice introduction.
The goal of this exercise is to learn to manipulate real data with Pandas. This exercise is less guided since the exercise 2 should have given you a nice introduction.
The data set used is **E-commerce purchases**.
@ -240,7 +240,7 @@ Questions:
12. What are the top 5 most popular email providers/hosts (e.g. gmail.com, yahoo.com, etc...)
## Correction
The validate this exercice all answers should return the expected numerical value given in the correction AND uses Pandas. For example using NumPy to compute the mean doesn't respect the philosophy of the exercice which is to use Pandas.
The validate this exercise all answers should return the expected numerical value given in the correction AND uses Pandas. For example using NumPy to compute the mean doesn't respect the philosophy of the exercise which is to use Pandas.
1. How many rows and columns are there?**10000 entries**
@ -303,9 +303,9 @@ The validate this exercice all answers should return the expected numerical valu
The prefered solution is based on the usage of `apply` on a `lambda` function that slices the string that contains the email. The `lambda` function uses `split` to split the string on `@`. Finally, `value_counts` is used to count the occurences.
# Exercice 3 Handling missing values
# Exercise 3 Handling missing values
The goal of this exercice is to learn to handle missing values. In the previsous exercice we used the first techniques: filter out the missing values. We were lucky because the proportion of missing values was low. But in some cases, dropping the missing values is not possible because the filtered data set would be too small.
The goal of this exercise is to learn to handle missing values. In the previsous exercise we used the first techniques: filter out the missing values. We were lucky because the proportion of missing values was low. But in some cases, dropping the missing values is not possible because the filtered data set would be too small.
This article explains the different types of missing data and how they should be handled. https://towardsdatascience.com/data-cleaning-with-python-and-pandas-detecting-missing-values-3e9c6ebcf78b
"
@ -327,7 +327,7 @@ This article explains the different types of missing data and how they should be
## Correction
To validate the exercice, you should have done these two steps in that order:
To validate the exercise, you should have done these two steps in that order:
- Convert the numerical columns to `float`
```

24
one_md_per_day_format/piscine/Week1/day3.md

@ -33,9 +33,9 @@ https://jakevdp.github.io/PythonDataScienceHandbook/05.13-kernel-density-estimat
# Exercice 1 Pandas plot 1
# Exercise 1 Pandas plot 1
The goal of this exercice is to learn to create plots with use Pandas. Panda's `.plot()` is a wrapper for `matplotlib.pyplot.plot()`.
The goal of this exercise is to learn to create plots with use Pandas. Panda's `.plot()` is a wrapper for `matplotlib.pyplot.plot()`.
Here is the data we will be using:
@ -69,9 +69,9 @@ The plot has to contain:
[logo]: images/day03/w1day03_ex1_plot1.png "Bar plot ex1"
## Exercice 2: Pandas plot 2
## Exercise 2: Pandas plot 2
The goal of this exercice is to learn to create plots with use Pandas. Panda's `.plot()` is a wrapper for `matplotlib.pyplot.plot()`.
The goal of this exercise is to learn to create plots with use Pandas. Panda's `.plot()` is a wrapper for `matplotlib.pyplot.plot()`.
```
@ -108,7 +108,7 @@ You should also observe that the older people are the bigger the number of chil
## Exercice 3 Matplotlib 1
## Exercise 3 Matplotlib 1
The goal of this plot is to learn to use Matplotlib to plot data. As you know, Matplotlib is the underlying library used by Pandas. It provides more options to plot custom visualizations. Howerver, most of the plots we will create with Matplotlib can be reproduced with Pandas' `.plot()`.
@ -145,7 +145,7 @@ The plot has to contain:
[logo_ex3]: images/day03/w1day03_ex3_plot1.png "Scatter plot ex3"
# Exercice 4 Matplotlib 2
# Exercise 4 Matplotlib 2
The goal of this plot is to learn to use Matplotlib to plot different lines in the same plot on different axis using `twinx`. This very useful to compare variables in different ranges.
Here is the data:
@ -187,7 +187,7 @@ The plot has to contain:
https://matplotlib.org/gallery/api/two_scales.html
# Exercice 5 Matplotlib subplots
# Exercise 5 Matplotlib subplots
The goal of this exerice is to learn to use Matplotlib to create subplots.
1. Reproduce this plot using a **for loop**:
@ -224,14 +224,14 @@ The plot has to contain:
Check that the plot has been created with a for loop.
# Exercice 6 Plotly 1
# Exercise 6 Plotly 1
Plotly has evolved a lot in the previous years. It is important to **always check the documentation**.
Plotly comes with a high level interface: Plotly Express. It helps building some complex plots easily. The lesson won't detail the complex examples. Plotly express is quite interesting while using Pandas Dataframes because there are some built-in functions that leverage Pandas Dataframes.
The plot outputed by Plotly is interactive and can also be dynamic.
The goal of the exercice is to plot the price of a company. Its price is generated below.
The goal of the exercise is to plot the price of a company. Its price is generated below.
```
returns = np.random.randn(50)
@ -284,9 +284,9 @@ The plot has to contain:
[logo_ex6]: images/day03/w1day03_ex6_plot1.png "Time series ex6"
# Exercice 7 Plotly Box plots
# Exercise 7 Plotly Box plots
The goal of this exercice is to learn to use Plotly to plot Box Plots. It is t is a method for graphically depicting groups of numerical data through their quartiles and values as min, max. It allows to compare quickly some variables.
The goal of this exercise is to learn to use Plotly to plot Box Plots. It is t is a method for graphically depicting groups of numerical data through their quartiles and values as min, max. It allows to compare quickly some variables.
Let us generate 3 random arrays from a normal distribution. And for each array add respectively 1, 2 to the normal distribution.
@ -295,7 +295,7 @@ y0 = np.random.randn(50)
y1 = np.random.randn(50) + 1 # shift mean
y2 = np.random.randn(50) + 2
```
1. Plot in the same Figure 2 box plots as shown in the image. In this exercice the style is not important.
1. Plot in the same Figure 2 box plots as shown in the image. In this exercise the style is not important.
![alt text][logo_ex7]

26
one_md_per_day_format/piscine/Week1/day4.md

@ -25,9 +25,9 @@ https://towardsdatascience.com/different-ways-to-iterate-over-rows-in-a-pandas-d
# Exercice 1 Concatenate
# Exercise 1 Concatenate
The goal of this exercice is to learn to concatenate DataFrames. The logic is the same for the Series.
The goal of this exercise is to learn to concatenate DataFrames. The logic is the same for the Series.
Here are the two DataFrames to concatenate:
@ -55,9 +55,9 @@ df2 = pd.DataFrame([['c', 1], ['d', 2]],
| 3 | d | 2 |
# Exercice 2 Merge
# Exercise 2 Merge
The goal of this exercice is to learn to merge DataFrames
The goal of this exercise is to learn to merge DataFrames
The logic of merging DataFrames in Pandas is quite similar as the one used in SQL.
Here are the two DataFrames to merge:
@ -125,9 +125,9 @@ df2 = pd.DataFrame(df2_dict, columns = ['id', 'Feature1', 'Feature2'])
Note: Check that the suffixes are set using the suffix parameters rather than manually changing the columns' name.
## Exercice 3 Merge MultiIndex
## Exercise 3 Merge MultiIndex
The goal of this exercice is to learn to merge DataFrames with MultiIndex.
The goal of this exercise is to learn to merge DataFrames with MultiIndex.
Use the code below to generate the DataFrames. `market_data` contains fake market data. In finance, the market is available during the trading days (business days). `alternative_data` contains fake alternative data from social media. This data is available every day. But, for some reasons the Data Engineer lost the last 15 days of alternative data.
1. Using `market_data` as the reference, merge `alternative_data` on `market_data`
@ -182,9 +182,9 @@ One of the answers that returns the correct DataFrame is:
2. This question is validated if the number of missing in the DataFrame is equal to 0 and if `filled_df.sum().sum() == merged_df.sum().sum()` gives: `True`
# Exercice 4 Groupby Apply
# Exercise 4 Groupby Apply
The goal of this exercice is to learn to group the data and apply a function on the groups.
The goal of this exercise is to learn to group the data and apply a function on the groups.
The use case we will work on is computing
1. Create a function that uses `pandas.DataFrame.clip` and that replace extreme values by a given percentile. The values that are greater than the upper percentile 80% are replaced by the percentile 80%. The values that are smaller than the lower percentile 20% are replaced by the percentile 20%. This process that correct outliers is called **winsorizing**.
@ -251,7 +251,7 @@ I recommend to use NumPy to compute the percentiles to make sure we used the sam
## Correction
The for loop is forbidden in this exercice. The goal is to use `groupby` and `apply`.
The for loop is forbidden in this exercise. The goal is to use `groupby` and `apply`.
1. This question is validated if the output is:
@ -315,9 +315,9 @@ https://towardsdatascience.com/how-to-use-the-split-apply-combine-strategy-in-pa
# Exercice 5 Groupby Agg
# Exercise 5 Groupby Agg
The goal of this exercice is to learn to compute different type of agregations on the groups. This small DataFrame contains products and prices.
The goal of this exercise is to learn to compute different type of agregations on the groups. This small DataFrame contains products and prices.
| | value | product |
|---:|--------:|:-------------|
@ -353,9 +353,9 @@ Note: The columns don't have to be MultiIndex
My answer is: `df.groupby('product').agg({'value':['min','max','mean']})`
# Exercice 6 Unstack
# Exercise 6 Unstack
The goal of this exercice is to learn to unstack a MultiIndex.
The goal of this exercise is to learn to unstack a MultiIndex.
Let's assume we trained a machine learning model that predicts a daily score on the companies (tickers) below. It may be very useful to unstack the MultiIndex: plot the time series, vectorize the backtest etc ...
```

20
one_md_per_day_format/piscine/Week1/day5.md

@ -31,9 +31,9 @@ https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/
# Exercice 1
# Exercise 1
The goal of this exercice is to learn to manipulate time series in Pandas.
The goal of this exercise is to learn to manipulate time series in Pandas.
1. Create a `Series` named `integer_series`from 1st January 2010 to 31 December 2020. At each date is associated the number of days since 1st January 2010. It starts with 0.
@ -79,9 +79,9 @@ The goal of this exercice is to learn to manipulate time series in Pandas.
```
If the `NaN` values have been dropped the solution is also accepted. The solution uses `rolling().mean()`.
# Exercice 2
# Exercise 2
The goal of this exercice is to learn to use Pandas on Time Series an on Financial data.
The goal of this exercise is to learn to use Pandas on Time Series an on Financial data.
The data we will use is Apple stock.
@ -144,11 +144,11 @@ To get this result there are two ways: `resample` and `groupby`. There are two k
Name: Open, Length: 10118, dtype: float64
```
- The first way is to compute the return without for loop is to use `pct_change`
- The second way to compute the return without for loop is to implement the formula given in the exercice in a vectorized way. To get the value at `t-1` you can use `shift`
- The second way to compute the return without for loop is to implement the formula given in the exercise in a vectorized way. To get the value at `t-1` you can use `shift`
# Exercice 3 Multi asset returns
# Exercise 3 Multi asset returns
The goal of this exercice is to learn to compute daily returns on a DataFrame that contains many assets (multi-assets).
The goal of this exercise is to learn to compute daily returns on a DataFrame that contains many assets (multi-assets).
```
business_dates = pd.bdate_range('2021-01-01', '2021-12-31')
@ -187,9 +187,9 @@ Note: The data is generated randomly, the values you may have a different result
The DataFrame contains random data. Make sure your output and the one returned by this code is based on the same DataFrame.
# Exercice 4 Backtest
# Exercise 4 Backtest
The goal of this exercice is to learn to perform a backtest in Pandas. A backtest is a tool that allows you to know how a strategy would have performed retrospectively using historical data. In this exercice we will focus on the backtesting tool and not on how to build the best strategy.
The goal of this exercise is to learn to perform a backtest in Pandas. A backtest is a tool that allows you to know how a strategy would have performed retrospectively using historical data. In this exercise we will focus on the backtesting tool and not on how to build the best strategy.
We will backtest a **long only** strategy on Apple Inc. Long only means that we only consider buying the stock. The input signal at date d says if the close price will increase at d+1. We assume that the input signal is available before the market closes.
@ -266,7 +266,7 @@ My results can be reproduced using: `np.random.seed = 2712`. Given the versions
Name: Daily_futur_returns, Length: 10118, dtype: float64
```
The answer is also accepted if the returns is computed as in the exercice 2 and then shifted in the futur using `shift`, but I do not recommend this implementation as it adds missing values !
The answer is also accepted if the returns is computed as in the exercise 2 and then shifted in the futur using `shift`, but I do not recommend this implementation as it adds missing values !
An example of solution is:

32
one_md_per_day_format/piscine/Week2/day03.md

@ -35,9 +35,9 @@ This object takes as input the preprocessing transforms and a Machine Learning m
## Ressources
TODO
# Exercice 1 Imputer 1
# Exercise 1 Imputer 1
The goal of this exercice is to learn how to use an Imputer to fill missing values on basic example.
The goal of this exercise is to learn how to use an Imputer to fill missing values on basic example.
```
train_data = [[7, 6, 5],
@ -84,11 +84,11 @@ test_data = [[np.nan, 1, 2],
[ 4., 2., 4.]])
```
# Exercice 2 Scaler
# Exercise 2 Scaler
The goal of this exercice is to learn to scale a data set. There are various scaling techniques, we will focus on `StandardScaler` from scikit learn.
The goal of this exercise is to learn to scale a data set. There are various scaling techniques, we will focus on `StandardScaler` from scikit learn.
We will use a tiny data set for this exercice that we will generate by ourselves:
We will use a tiny data set for this exercise that we will generate by ourselves:
```
X_train = np.array([[ 1., -1., 2.],
@ -140,8 +140,8 @@ array([[ 1.22474487, -1.22474487, 0.53452248],
[ 0. , 1.22474487, 0.53452248]])
```
# Exercice 3 One hot Encoder
The goal of this exercice is to learn how to deal with Categorical variables using the OneHot Encoder.
# Exercise 3 One hot Encoder
The goal of this exercise is to learn how to deal with Categorical variables using the OneHot Encoder.
```
X_train = [['Python'], ['Java'], ['Java'], ['C++']]
@ -199,8 +199,8 @@ https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEn
# Exercice 4 Ordinal Encoder
The goal of this exercice is to learn how to deal with Categorical variables using the Ordinal Encoder.
# Exercise 4 Ordinal Encoder
The goal of this exercise is to learn how to deal with Categorical variables using the Ordinal Encoder.
In that case, we want the model to consider that: **good > neutral > bad**
@ -242,9 +242,9 @@ array([[2.],
# Exercice 5 Categorical variables
# Exercise 5 Categorical variables
The goal of this exercice is to learn how to deal with Categorical variables with Ordinal Encoder, Label Encoder and OneHot Encoder.
The goal of this exercise is to learn how to deal with Categorical variables with Ordinal Encoder, Label Encoder and OneHot Encoder.
Preliminary:
- Load the breast-cancer.csv file
@ -359,7 +359,7 @@ AttributeError: Transformer ordinalencoder (type OrdinalEncoder) does not provid
```
**It means that if you want to use the Ordinal Encoder, you will have to create a variable that contains the columns name in the right order. This step is not required in that exercice**
**It means that if you want to use the Ordinal Encoder, you will have to create a variable that contains the columns name in the right order. This step is not required in that exercise**
@ -438,9 +438,9 @@ array([[1., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0., 1., 0., 2., 2., 0.,
```
# Exercice 6 Pipeline
# Exercise 6 Pipeline
The goal of this exercice is to learn to use the Scikit-learn object: Pipeline. The data set: used for this exercice is the `iris` data set.
The goal of this exercise is to learn to use the Scikit-learn object: Pipeline. The data set: used for this exercise is the `iris` data set.
Preliminary:
- Run the code below.
@ -513,9 +513,9 @@ On financial data set, the ratio signal to noise is low. Trying to forecast stoc
# Exercice 1 Imputer 2
# Exercise 1 Imputer 2
The goal of this exercice is to learn how to use an Imputer to fill missing values in the data set.
The goal of this exercise is to learn how to use an Imputer to fill missing values in the data set.
**Reminder**: The data exploration should be done first. It tells which rows/variables should be removed because there are too many missing values. Then the remaining data points can be treated using an Imputer.

22
one_md_per_day_format/piscine/Week2/day05.md

@ -6,12 +6,12 @@
# Introduction
If you finished yesterday's exercices you should be able to train several Machine Learning algorithms and to choose one returned by GridSearchCV.
If you finished yesterday's exercises you should be able to train several Machine Learning algorithms and to choose one returned by GridSearchCV.
GridSearchCV returns the model that gives the best score on the test set. Yesterday, as I told you, I changed the **cv** parameter to compute the GridSearch with a train set and a test set.
It means that the selected model is based on one single measure. What if, by luck, we predict correctly on that section ? What if the best model is bad ? What if I could have selected a better model ?
We will answer these questions today ! The topics we will cover are the one of the most important in Machine Learning.
Must read before to start the exercices:
Must read before to start the exercises:
- Biais-Variance trade off; aka Underfitting/Overfitting.
- https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine-learning/
@ -28,9 +28,9 @@ Must read before to start the exercices:
## Ressources
# Exercice 1: K-Fold
# Exercise 1: K-Fold
The goal of this exercice is to learn to use `KFold` to split the data set in a k-fold cross validation. Most of the time you won't use this function to split your data because this function is used by others as `cross_val_score` or `cross_validate` or `GridSearchCV` ... . But, this allows to understand the splitting and to create a custom one if needed.
The goal of this exercise is to learn to use `KFold` to split the data set in a k-fold cross validation. Most of the time you won't use this function to split your data because this function is used by others as `cross_val_score` or `cross_validate` or `GridSearchCV` ... . But, this allows to understand the splitting and to create a custom one if needed.
```
X = np.array(np.arange(1,21).reshape(10,-1))
@ -81,9 +81,9 @@ y = np.array(np.arange(1,11))
# Exercice 2: Cross validation (k-fold)
# Exercise 2: Cross validation (k-fold)
The goal of this exercice is to learn how to use cross validation. After reading the articles you should be able to explain why we need to cross-validate the models. We will firstly focus on Linear Regression to reduce the computation time. We will be using `cross_validate` to run the cross validation. Note that `cross_val_score` is similar bu the `cross_validate` calculates one or more scores and timings for each CV split.
The goal of this exercise is to learn how to use cross validation. After reading the articles you should be able to explain why we need to cross-validate the models. We will firstly focus on Linear Regression to reduce the computation time. We will be using `cross_validate` to run the cross validation. Note that `cross_val_score` is similar bu the `cross_validate` calculates one or more scores and timings for each CV split.
Preliminary:
@ -159,9 +159,9 @@ The model is consistent across folds: it is stable. That's a first sign that the
# Exercice 3 GridsearchCV
# Exercise 3 GridsearchCV
The goal of this exercice is to learn to use GridSearchCV to run a grid search, predict on the test set and score on the test set.
The goal of this exercise is to learn to use GridSearchCV to run a grid search, predict on the test set and score on the test set.
Preliminary:
@ -250,13 +250,13 @@ WARNING: If the score used in classification is the AUC, there is one rare case
# Exercice 5 Validation curve and Learning curve
# Exercise 5 Validation curve and Learning curve
The goal of this exercice is to learn to analyse the models' performance with two tools:
The goal of this exercise is to learn to analyse the models' performance with two tools:
- Validation curve
- Learning curve
For this exercice we will use a dataset of 100k data points to give you an idea of the computation time you can expect during projects.
For this exercise we will use a dataset of 100k data points to give you an idea of the computation time you can expect during projects.
Preliminary:

22
one_md_per_day_format/piscine/Week2/day1.md

@ -51,9 +51,9 @@ https://scikit-learn.org/stable/tutorial/index.html
- https://developers.google.com/machine-learning/crash-course/training-and-test-sets/video-lecture?hl=en
# Exercice 1 Scikit-learn estimator
# Exercise 1 Scikit-learn estimator
The goal of this exercice is to learn to fit a Scikit-learn estimator and use it to predict.
The goal of this exercise is to learn to fit a Scikit-learn estimator and use it to predict.
```
@ -92,9 +92,9 @@ X, y = [[1],[2.1],[3]], [[1],[2],[3]]
```
# Exercice 2 Linear regression in 1D
# Exercise 2 Linear regression in 1D
The goal of this exercice is to understand how the linear regression works in one dimension. To do so, we will generate a data in one dimension. Using `make regression` from Scikit-learn, generate a data set with 100 observations:
The goal of this exercise is to understand how the linear regression works in one dimension. To do so, we will generate a data in one dimension. Using `make regression` from Scikit-learn, generate a data set with 100 observations:
```
X, y, coef = make_regression(n_samples=100,
@ -162,9 +162,9 @@ array([ 83.86186727, 140.80961751, 116.3333897 , 64.52998689,
6. This question is validated if the MSE returned is `2854.2871542048706`
# Exercice 3: Train test split
# Exercise 3: Train test split
The goal of this exercice is to learn to split a data set. It is important to understand why we split the data in two sets. To put it in a nutshell: the Machine Learning algorithms learns on the training data and is evaluated on the that it hasn't seen before: the testing data.
The goal of this exercise is to learn to split a data set. It is important to understand why we split the data in two sets. To put it in a nutshell: the Machine Learning algorithms learns on the training data and is evaluated on the that it hasn't seen before: the testing data.
This video gives a basic and nice explanation: https://www.youtube.com/watch?v=_vdMKioCXqQ
@ -208,10 +208,10 @@ y_test:
[ 9 10]
```
# Exercice 4 Forecast diabetes progression
# Exercise 4 Forecast diabetes progression
The goal of this exercice is to use Linear Regression to forecast the progression of diabetes. It will not always be precised, you should **ALWAYS** start doing an exploratory data analysis in order to have a good understanding of the data you model. As a reminder here an introduction to EDA:
The goal of this exercise is to use Linear Regression to forecast the progression of diabetes. It will not always be precised, you should **ALWAYS** start doing an exploratory data analysis in order to have a good understanding of the data you model. As a reminder here an introduction to EDA:
https://towardsdatascience.com/exploratory-data-analysis-eda-a-practical-guide-and-template-for-structured-data-abfbf3ee3bd9
The data set used is described in https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.
@ -300,11 +300,11 @@ https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset
4. This question is validated if the mse on the **train set** is `2888.326888` and the mse on the **test set** is `2858.255153`.
## Exercice 5 Gradient Descent
## Exercise 5 Gradient Descent
The goal of this exercice is to understand how the Linear Regression algorithm finds the optimal coefficients.
The goal of this exercise is to understand how the Linear Regression algorithm finds the optimal coefficients.
The goal is to fit a Linear Regression on a one dimensional features data **without using Scikit-learn**. Let's use the data set we generated for the exercice 1:
The goal is to fit a Linear Regression on a one dimensional features data **without using Scikit-learn**. Let's use the data set we generated for the exercise 1:
```

28
one_md_per_day_format/piscine/Week2/day2.md

@ -31,8 +31,8 @@ More details:
https://towardsdatascience.com/understanding-logistic-regression-9b02c2aec102
For the linear regression exercices, the loss (Mean Square Error - MSE) is minimized with an algorithm called **gradient descent**. In the classification, the loss MSE can't be used because the output of the model is 0 or 1 (for binary classfication).
The **logloss** or **cross entropy** is the loss used for classification. Similarly, it has some nice mathematical properties. The minimization of the **logloss** is not covered in the exercices. However, since it is used in most machine learning models for classification, I recommand to spend some time reading the related article. This article gives a nice example of how it works:
For the linear regression exercises, the loss (Mean Square Error - MSE) is minimized with an algorithm called **gradient descent**. In the classification, the loss MSE can't be used because the output of the model is 0 or 1 (for binary classfication).
The **logloss** or **cross entropy** is the loss used for classification. Similarly, it has some nice mathematical properties. The minimization of the **logloss** is not covered in the exercises. However, since it is used in most machine learning models for classification, I recommand to spend some time reading the related article. This article gives a nice example of how it works:
https://towardsdatascience.com/cross-entropy-for-classification-d98e7f974451
@ -48,9 +48,9 @@ https://medium.com/swlh/what-is-logistic-regression-62807de62efa
# Exercice 1 Logistic regression in Scikit-learn
# Exercise 1 Logistic regression in Scikit-learn
The goal of this exercice is to learn to use Scikit-learn to classify data.
The goal of this exercise is to learn to use Scikit-learn to classify data.
```
X = [[0],[0.1],[0.2], [1],[1.1],[1.2], [1.3]]
y = [0,0,0,1,1,1,0]
@ -93,9 +93,9 @@ Score:
```
# Exercice 2 Sigmoid
# Exercise 2 Sigmoid
The goal of this exercice is to learn to compute and plot the sigmoid function.
The goal of this exercise is to learn to compute and plot the sigmoid function.
1. On the same plot, plot the sigmoid function and the custom sigmoids defined as:
```
@ -121,9 +121,9 @@ The plot should look like this:
# Exercice 3 Decision boundary
# Exercise 3 Decision boundary
The goal of this exercice is to learn to fit a logistic regression on simple examples and to understand how the algorithm separated the data from the different classes.
The goal of this exercise is to learn to fit a logistic regression on simple examples and to understand how the algorithm separated the data from the different classes.
## 1 dimension
@ -304,9 +304,9 @@ As mentioned, it is not required to shift the class prediction to make the plot
# Exercice 4: Train test split
# Exercise 4: Train test split
The goal of this exercice is to learn to split a classification data set. The idea is the same as splitting a regression data set but there's one important detail specific to the classification: the proportion of each class in the train set and test set.
The goal of this exercise is to learn to split a classification data set. The idea is the same as splitting a regression data set but there's one important detail specific to the classification: the proportion of each class in the train set and test set.
@ -358,9 +358,9 @@ The proportion of class `1` is **0.125** in the train set and **1.** in the test
2. This question is validated if the proportion of class `1` is **0.3** for both sets.
# Exercice 5 Breast Cancer prediction
# Exercise 5 Breast Cancer prediction
The goal of this exercice is to use Logistic Regression
The goal of this exercise is to use Logistic Regression
to predict breast cancer. It is always important to understand the data before training any Machine Learning algorithm. The data is described in **breast-cancer-wisconsin.names**. I suggest to add manually the column names in the DataFrame.
Preliminary:
@ -439,9 +439,9 @@ array([[90, 2],
As said, for some reasons, you may have slighty different results because of the data splitting. However, the values you have in the confusion matrix should be close to these results.
# Exercice 6 Multi-class (Optional)
# Exercise 6 Multi-class (Optional)
The goal of this exercice is to learn to train a classfication algorithm on a multi-class labelled data.
The goal of this exercise is to learn to train a classfication algorithm on a multi-class labelled data.
Some algorithms as SVM or Logistic Regression do not natively support multi-class (more than 2 classes). There are some approaches that allow to use these algorithms on multi-class data.
Let's assume we work with 3 classes: A, B and C.

28
one_md_per_day_format/piscine/Week2/day4.md

@ -36,9 +36,9 @@ https://www.kdnuggets.com/2018/06/right-metric-evaluating-machine-learning-model
https://scikit-learn.org/stable/modules/model_evaluation.html
# Exercice 1 MSE Scikit-learn
# Exercise 1 MSE Scikit-learn
The goal of this exercice is to learn to use `sklearn.metrics` to compute the mean squared error (MSE).
The goal of this exercise is to learn to use `sklearn.metrics` to compute the mean squared error (MSE).
1. Compute the MSE using `sklearn.metrics` on `y_true` and `y_pred` below:
@ -51,10 +51,10 @@ y_pred = [90, 48, 2, 2, -4]
1. This question is validated if the MSE outputted is **2.25**.
# Exercice 2 Accuracy Scikit-learn
# Exercise 2 Accuracy Scikit-learn
The goal of this exercice is to learn to use `sklearn.metrics` to compute the accuracy.
The goal of this exercise is to learn to use `sklearn.metrics` to compute the accuracy.
1. Compute the accuracy using `sklearn.metrics` on `y_true` and `y_pred` below:
@ -68,9 +68,9 @@ y_true = [0, 0, 1, 1, 1, 1, 0]
# Exercice 3 Regression
# Exercise 3 Regression
The goal of this exercice is to learn to evaluate a machine learning model using many regression metrics.
The goal of this exercise is to learn to evaluate a machine learning model using many regression metrics.
Preliminary:
@ -138,13 +138,13 @@ pipe.fit(X_train, y_train)
MSE on the test set: 0.5537420654727396
```
This result shows that the model has slightly better results on the train set than the test set. That's frequent since it is easier to get a better grade on an exam we studied than an exam that is different from what was prepared. However, the results are not good: r2 ~ 0.3. Fitting non linear models as the Random Forest on this data may improve the results. That's the goal of the exercice 5.
This result shows that the model has slightly better results on the train set than the test set. That's frequent since it is easier to get a better grade on an exam we studied than an exam that is different from what was prepared. However, the results are not good: r2 ~ 0.3. Fitting non linear models as the Random Forest on this data may improve the results. That's the goal of the exercise 5.
# Exercice 4 Classification
# Exercise 4 Classification
The goal of this exercice is to learn to evaluate a machine learning model using many classification metrics.
The goal of this exercise is to learn to evaluate a machine learning model using many classification metrics.
Preliminary:
@ -232,9 +232,9 @@ Having a 99% ROC AUC is not usual. The data set we used is easy to classify. On
# Exercice 5 Machine Learning models
# Exercise 5 Machine Learning models
The goal of this exercice is to have an overview of the existing Machine Learning models and to learn to call them from scikit learn.
The goal of this exercise is to have an overview of the existing Machine Learning models and to learn to call them from scikit learn.
We will focus on:
- SVM/ SVC
@ -363,9 +363,9 @@ Take time to have basic understanding of the role of the basic hyperparameters a
It is important to notice that the Decision Tree overfits very easily. It learns easily the training data but is not able to extrapolate on the test set. This algorithm is not used a lot.
However, Random Forest and Gradient Boosting propose a solid approach to correct the overfitting (in that case the parameters `max_depth` is set to None that is why the Random Forest overfits the data). These two algorithms are used intensively in Machine Learning Projets.
# Exercice 6 Grid Search
# Exercise 6 Grid Search
The goal of this exercice is to learn how to make an exhaustive search over specified parameter values for an estimator. This is very useful because the hyperparameters which are the paremeters of the model impact the performance of the model.
The goal of this exercise is to learn how to make an exhaustive search over specified parameter values for an estimator. This is very useful because the hyperparameters which are the paremeters of the model impact the performance of the model.
The scikit learn object that runs the Grid Search is called GridSearchCV. We will learn tomorrow about the cross validation. For now, let us set the parameter **cv** to `[(np.arange(18576), np.arange(18576,20640))]`.
This means that GridSearchCV splits the data set in a train and test set.
@ -450,7 +450,7 @@ Ressources:
return gs.best_estimator_, gs.best_params_, gs.best_score_
```
In my case, the gridsearch parameters are not interesting. Even if I reduced the overfitting of the Random Forest, the score on the test is lower than the score on the test returned by the Gradient Boosting in the previous exercice without optimal parameters search.
In my case, the gridsearch parameters are not interesting. Even if I reduced the overfitting of the Random Forest, the score on the test is lower than the score on the test returned by the Gradient Boosting in the previous exercise without optimal parameters search.
3. This question is validated if the code used is:

10
one_md_per_day_format/piscine/Week2/template.md

@ -16,21 +16,21 @@
## Ressources
# Exercice 1
# Exercise 1
# Exercice 2
# Exercise 2
# Exercice 3
# Exercise 3
# Exercice 4
# Exercise 4
# Exercice 5
# Exercise 5

10
one_md_per_day_format/piscine/Week3/template.md

@ -16,21 +16,21 @@
## Ressources
# Exercice 1
# Exercise 1
# Exercice 2
# Exercise 2
# Exercice 3
# Exercise 3
# Exercice 4
# Exercise 4
# Exercice 5
# Exercise 5

18
one_md_per_day_format/piscine/Week3/w3day02.md

@ -10,7 +10,7 @@ The goal of this day is to learn to use Keras to build Neural Networks.
There are two ways to build Keras models: sequential and functional.
The sequential API allows you to create models layer-by-layer for most problems. It is limited in that it does not allow you to create models that share layers or have multiple inputs or outputs. The exercices focuses on the usage of the sequential API.
The sequential API allows you to create models layer-by-layer for most problems. It is limited in that it does not allow you to create models that share layers or have multiple inputs or outputs. The exercises focuses on the usage of the sequential API.
'2.4.3'
@ -25,9 +25,9 @@ A developper
## Ressources
https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
# Exercice 1 Sequential
# Exercise 1 Sequential
The goal of this exercice is to learn to call the object `Sequential`.
The goal of this exercise is to learn to call the object `Sequential`.
1. Put the object Sequential in a variable named `model` and print the variable `model`.
@ -39,9 +39,9 @@ The goal of this exercice is to learn to call the object `Sequential`.
# Exercice 2 Dense
# Exercise 2 Dense
The goal of this exercice is to learn to create layers of neurons. Keras proposes options to create custom layers. The neural networks build in these exercices do not require custom layers. `Dense` layers do the job. A dense layer is simply a layer where each unit or neuron is connected to each neuron in the next layer. As seen yesterday, there are three main types of layers: input, hidden and output. The **input layer** that specifies the number of inputs (features) is not represented as a layer in Keras. However, `Dense` has a parameter `input_dim` that gives the number of inputs in the previous layer. The output layer as any hidden layer can be created using `Dense`, the only difference is that the output layer contains one single neuron.
The goal of this exercise is to learn to create layers of neurons. Keras proposes options to create custom layers. The neural networks build in these exercises do not require custom layers. `Dense` layers do the job. A dense layer is simply a layer where each unit or neuron is connected to each neuron in the next layer. As seen yesterday, there are three main types of layers: input, hidden and output. The **input layer** that specifies the number of inputs (features) is not represented as a layer in Keras. However, `Dense` has a parameter `input_dim` that gives the number of inputs in the previous layer. The output layer as any hidden layer can be created using `Dense`, the only difference is that the output layer contains one single neuron.
1. Create a `Dense` layer with these parameters and return the output of `get_config`:
@ -121,9 +121,9 @@ The goal of this exercice is to learn to create layers of neurons. Keras propose
'bias_constraint': None}
```
# Exercice 3 Architecture
# Exercise 3 Architecture
The goal of this exercice is to combine the layers and to create a neural network.
The goal of this exercise is to combine the layers and to create a neural network.
1. Create a neural network for regression with the following architecture and return `print(model.summary())`:
@ -145,9 +145,9 @@ The goal of this exercice is to combine the layers and to create a neural networ
```
The first two layers could use another activation function that sigmoid (eg: relu)
# Exercice 4 Optimize
# Exercise 4 Optimize
The goal of this exercice is to learn to train the neural network. Once the architecture of the neural network is set there are two steps to train the neural network:
The goal of this exercise is to learn to train the neural network. Once the architecture of the neural network is set there are two steps to train the neural network:
- `compile`: The compilation step aims to set the loss function, to choose the algoithm to minimize the chosen loss function and to choose the metric the model outputs.

22
one_md_per_day_format/piscine/Week3/w3day03.md

@ -24,9 +24,9 @@ A developper
https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
# Exercice 1 Regression - Optimize
# Exercise 1 Regression - Optimize
The goal of this exercice is to learn to set up the optimization for a regression neural network. There's no code to run in that exercice. In W2D2E3, we implemented a neural network designed for regression. We will be using this neural network:
The goal of this exercise is to learn to set up the optimization for a regression neural network. There's no code to run in that exercise. In W2D2E3, we implemented a neural network designed for regression. We will be using this neural network:
```
model = keras.Sequential()
@ -68,9 +68,9 @@ https://keras.io/api/losses/regression_losses/
https://keras.io/api/metrics/regression_metrics/
# Exercice 2 Regression example
# Exercise 2 Regression example
The goal of this exercice is to learn to train a neural network to perform a regression on a data set.
The goal of this exercise is to learn to train a neural network to perform a regression on a data set.
The data set is Auto MPG Dataset and the go is to build a model to predict the fuel efficiency of late-1970s and early 1980s automobiles. To do this, provide the model with a description of many automobiles from that time period. This description includes attributes like: cylinders, displacement, horsepower, and weight.
https://www.tensorflow.org/tutorials/keras/regression
@ -150,9 +150,9 @@ The output neuron has to be `Dense(1)` - by defaut the activation funtion is lin
*Hint*: To get the score on the test set, `evaluate` could have been used: `model.evaluate(X_test_scaled, y_test)`.
# Exercice 3 Multi classification - Softmax
# Exercise 3 Multi classification - Softmax
The goal of this exercice is to learn to a neural network architecture for multi-class data. This is an important type of problem on which to practice with neural networks because the three class values require specialized handling. A multi-classification neural network uses as output layer a **softmax** layer. The **softmax** activation function is an extension of the sigmoid as it is designed to output the probabilities to belong to each class in a multi-class problem. This output layer has to contain as much neurons as classes in the multi-classification problem. This article explains in detail how it works. https://developers.google.com/machine-learning/crash-course/multi-class-neural-networks/softmax
The goal of this exercise is to learn to a neural network architecture for multi-class data. This is an important type of problem on which to practice with neural networks because the three class values require specialized handling. A multi-classification neural network uses as output layer a **softmax** layer. The **softmax** activation function is an extension of the sigmoid as it is designed to output the probabilities to belong to each class in a multi-class problem. This output layer has to contain as much neurons as classes in the multi-classification problem. This article explains in detail how it works. https://developers.google.com/machine-learning/crash-course/multi-class-neural-networks/softmax
Let us assume we want to classify images and we know they contain either apples, bears, candies, eggs or dogs (extension of the example in the link above).
@ -175,9 +175,9 @@ Let us assume we want to classify images and we know they contain either apples,
model.add(Dense(5, activation= 'softmax'))
```
# Exercice 4 Multi classification - Optimize
# Exercise 4 Multi classification - Optimize
The goal of this exercice is to learn to optimize a multi-classification neural network. As learnt previously, the loss function used in binary classification is the log loss - also called in Keras `binary_crossentropy`. This function is defined for binary classification and can be extended to multi-classfication. In Keras, the extended loss that supports multi-classification is `binary_crossentropy`. There's no code to run in that exercice.
The goal of this exercise is to learn to optimize a multi-classification neural network. As learnt previously, the loss function used in binary classification is the log loss - also called in Keras `binary_crossentropy`. This function is defined for binary classification and can be extended to multi-classfication. In Keras, the extended loss that supports multi-classification is `binary_crossentropy`. There's no code to run in that exercise.
1. Fill the chunk of code below in order to optimize the neural network defined in the previous exercise. Choose the adapted loss, adam as optimizer and the accuracy as metric.
@ -196,9 +196,9 @@ model.compile(loss='categorical_crossentropy',
metrics=['accuracy'])
```
# Exercice 5 Multi classification example
# Exercise 5 Multi classification example
The goal of this exercice is to learn to use a neural network to classify a multiclass data set. The data set used is the Iris data set which allows to classify flower given basic features as flower's measurement.
The goal of this exercise is to learn to use a neural network to classify a multiclass data set. The data set used is the Iris data set which allows to classify flower given basic features as flower's measurement.
Preliminary:
- Split train test. Keep 20% for the test set. Use `random_state=1`.
@ -245,6 +245,6 @@ model.fit(X_train_sc, y_train_multi_class, epochs = 1000, batch_size=20)
# Exercice 6 GridSearch
# Exercise 6 GridSearch
https://medium.com/@am.benatmane/keras-hyperparameter-tuning-using-sklearn-pipelines-grid-search-with-cross-validation-ccfc74b0ce9f

40
one_md_per_day_format/piscine/Week3/w3day04.md

@ -29,12 +29,12 @@ Les packages NLTK and Spacy to do the preprocessing
## Ressources
# Exercice 1: Lowercase
# Exercise 1: Lowercase
The goal of this exercice is to learn to lowercase text data in Python. Note that if the volume of data is low the text data can be stored in a Pandas DataFrame or Series. But, when dealing with high volumes (high but not huge), using a Pandas DataFrame or Series is not efficient. Data structures as dictionaries or list are more adapted.
The goal of this exercise is to learn to lowercase text data in Python. Note that if the volume of data is low the text data can be stored in a Pandas DataFrame or Series. But, when dealing with high volumes (high but not huge), using a Pandas DataFrame or Series is not efficient. Data structures as dictionaries or list are more adapted.
```
list_ = ["This is my first NLP exercice", "wtf!!!!!"]
list_ = ["This is my first NLP exercise", "wtf!!!!!"]
series_data = pd.Series(list_, name='text')
```
@ -46,21 +46,21 @@ Note: Do not change the text manually !
1. This question is validated if the output is:
```
0 this is my first nlp exercice
0 this is my first nlp exercise
1 wtf!!!!!
Name: text, dtype: object
```
2. This question is validated if the output is:
```
0 THIS IS MY FIRST NLP EXERCICE
0 THIS IS MY FIRST NLP EXERCISE
1 WTF!!!!!
Name: text, dtype: object
```
# Exerice 2: Punctation
The goal of this exerice is to learn to deal with punctuation. In Natural Language Processing, some basic approaches as Bag of Words (exercice X) model the text as an unordered combination of words. In that case the punctuation is not always useful as it doesn't add information to the model. That is why is removed.
The goal of this exerice is to learn to deal with punctuation. In Natural Language Processing, some basic approaches as Bag of Words (exercise X) model the text as an unordered combination of words. In that case the punctuation is not always useful as it doesn't add information to the model. That is why is removed.
1. Remove the punctuation from this sentence. All characters in !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ are considered as punctuation.
@ -81,9 +81,9 @@ The goal of this exerice is to learn to deal with punctuation. In Natural Langua
```
# Exercice 3 Tokenization
# Exercise 3 Tokenization
The goal of this exercice is to learn to tokenize as text. This step is important because it splits the text into token. A token could be a sentence or a word.
The goal of this exercise is to learn to tokenize as text. This step is important because it splits the text into token. A token could be a sentence or a word.
```
text = """Bitcoin is a cryptocurrency invented in 2008 by an unknown person or group of people using the name Satoshi Nakamoto. The currency began use in 2009 when its implementation was released as open-source software."""
@ -152,13 +152,13 @@ https://www.analyticsvidhya.com/blog/2019/07/how-get-started-nlp-6-unique-ways-p
```
# Exercice 4 Stop words
# Exercise 4 Stop words
The goal of this exercice is to learn to remove stop words with NLTK. Stop words usually refers to the most common words in a language. For example: "and", "is", "a" are stop words and do not add information to a sentence.
The goal of this exercise is to learn to remove stop words with NLTK. Stop words usually refers to the most common words in a language. For example: "and", "is", "a" are stop words and do not add information to a sentence.
```
text = """
The goal of this exercice is to learn to remove stop words with NLTK. Stop words usually refers to the most common words in a language.
The goal of this exercise is to learn to remove stop words with NLTK. Stop words usually refers to the most common words in a language.
"""
```
1. Remove stop words from this sentence and return the list of work tokens without stop words.
@ -168,13 +168,13 @@ The goal of this exercice is to learn to remove stop words with NLTK. Stop word
1. This question is validated if, using NLTK, the ouptut is:
```
['The', 'goal', 'exercice', 'learn', 'remove', 'stop', 'words', 'NLTK', '.', 'Stop', 'words', 'usually', 'refers', 'common', 'words', 'language', '.']
['The', 'goal', 'exercise', 'learn', 'remove', 'stop', 'words', 'NLTK', '.', 'Stop', 'words', 'usually', 'refers', 'common', 'words', 'language', '.']
```
# Exercice 5 Stemming
# Exercise 5 Stemming
The goal of this exercice is to learn to use stemming using NLTK. As explained in details in the article, stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language.
The goal of this exercise is to learn to use stemming using NLTK. As explained in details in the article, stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language.
Note: The output of a stemmer is a word that may not exist in the dictionnary.
@ -196,9 +196,9 @@ The interviewer interviews the president in an interview
```
# Exercice 6: Text preprocessing
# Exercise 6: Text preprocessing
The goal of this exercice is to learn to create a function to prepocess and clean a text using NLTK.
The goal of this exercise is to learn to create a function to prepocess and clean a text using NLTK.
Put this text in a variable:
@ -267,22 +267,22 @@ https://towardsdatascience.com/nlp-preprocessing-with-nltk-3c04ee00edc0
```
# Exercice 7: Bag of Word representation
# Exercise 7: Bag of Word representation
https://machinelearningmastery.com/gentle-introduction-bag-words-model/
The goal of this exercice is to understand how to create a Bag of Word (BoW) model on a corpus of texts. More precesily we will create a labeled data set from textual data using a word count matrix.
The goal of this exercise is to understand how to create a Bag of Word (BoW) model on a corpus of texts. More precesily we will create a labeled data set from textual data using a word count matrix.
As explained in the ressource, the Bag of word reprensation makes the assumption that the order in which the words appear in a text doesn't matter. There are different types of Bag of words reprensations:
- Boolean: Each document is a boolean vector
- Wordcount: Each document is a word count vector
- TFIDF: Each document is a score vector. The score is detailed in the next exercice.
- TFIDF: Each document is a score vector. The score is detailed in the next exercise.
The data `tweets_train.txt` contains tweets labeled with a sentiment. It gives the positivity of a tweet.
Steps:
1. Preprocess the data using the function implemented in the previous exercice. And, using from `CountVectorizer` of scikitlearn with `max_features=500` compute the wordcount of the tweets. The output is a sparse matrix.
1. Preprocess the data using the function implemented in the previous exercise. And, using from `CountVectorizer` of scikitlearn with `max_features=500` compute the wordcount of the tweets. The output is a sparse matrix.
- Check the shape of the word count matrix
- Set **max_features** to 500 of the initial size of the dictionnary.

24
one_md_per_day_format/piscine/Week3/w3day05.md

@ -19,9 +19,9 @@ There are many type of language models pre-trained in Spacy. Each has its specif
## Ressources
# Exercice 1 Embedding 1
# Exercise 1 Embedding 1
The goal of this exercice is to learn to load an embedding on SpaCy.
The goal of this exercise is to learn to load an embedding on SpaCy.
1. Install and load `en_core_web_sm` embedding. Compute the embedding of `car`.
@ -40,10 +40,10 @@ array([ 1.0522802e+00, 1.4806499e+00, 7.7402556e-01, 1.0373484e+00,
```
# Exercice 2: Tokenization
# Exercise 2: Tokenization
The goal of this exercice is to learn to tokenize a document using Spacy. We did this using NLTK yesterday.
The goal of this exercise is to learn to tokenize a document using Spacy. We did this using NLTK yesterday.
1. Tokenize the text below and print the tokens
@ -68,9 +68,9 @@ The goal of this exercice is to learn to tokenize a document using Spacy. We did
.
```
## Exercice 3 Embeddings 2
## Exercise 3 Embeddings 2
The goal of this exercice is to learn to use SpaCy embedding on a document.
The goal of this exercise is to learn to use SpaCy embedding on a document.
1. Compute the embedding of all the words in this sentence. The language model considered is `en_core_web_md`
@ -106,9 +106,9 @@ https://medium.com/datadriveninvestor/cosine-similarity-cosine-distance-6571387f
[logo]: w3day05ex1_plot.png "Plot"
# Exercice 4 Sentences' similarity
# Exercise 4 Sentences' similarity
The goal of this exerice is to learn to compute the similarity between two sentences. As explained in the documentation: **The word embedding of a full sentence is simply the average over all different words**. This is how `similarity` works in SpaCy. This small use case is very interesting because if we build a corpus of sentences that express an intention as **buy shoes**, then we can detect this intention and use it to propose shoes advertisement for customers. The language model used in this exercice is `en_core_web_sm`.
The goal of this exerice is to learn to compute the similarity between two sentences. As explained in the documentation: **The word embedding of a full sentence is simply the average over all different words**. This is how `similarity` works in SpaCy. This small use case is very interesting because if we build a corpus of sentences that express an intention as **buy shoes**, then we can detect this intention and use it to propose shoes advertisement for customers. The language model used in this exercise is `en_core_web_sm`.
1. Compute the similarities (3 in total) between these sentences:
@ -135,9 +135,9 @@ The goal of this exerice is to learn to compute the similarity between two sente
# Exercice 5: NER
# Exercise 5: NER
The goal of this exercice is to learn to use a Named entity recognition algorithm to detect entities.
The goal of this exercise is to learn to use a Named entity recognition algorithm to detect entities.
```
Apple Inc. is an American multinational technology company headquartered in Cupertino, California, that designs, develops, and sells consumer electronics, computer software, and online services. It is considered one of the Big Five companies in the U.S. information technology industry, along with Amazon, Google, Microsoft, and Facebook.
@ -189,9 +189,9 @@ https://en.wikipedia.org/wiki/Named-entity_recognition
```
# Exercice 6 Part-of-speech tags
# Exercise 6 Part-of-speech tags
The goal od this exercice is to learn to use the Part-of-speech tags (**POS TAG**) using Spacy. As explained in wikipedia, the POS TAG is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context.
The goal od this exercise is to learn to use the Part-of-speech tags (**POS TAG**) using Spacy. As explained in wikipedia, the POS TAG is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context.
Example

20
one_md_per_day_format/piscine/Week3/w3day1.md

@ -23,9 +23,9 @@ https://srnghn.medium.com/deep-learning-overview-of-neurons-and-activation-funct
Reproduire cet article sans back prop
https://towardsdatascience.com/machine-learning-for-beginners-an-introduction-to-neural-networks-d49f22d238f9
# Exercice 1 The neuron
# Exercise 1 The neuron
The goal of this exercice is to understand the role of a neuron and to implement a neuron.
The goal of this exercise is to understand the role of a neuron and to implement a neuron.
An artificial neuron, the basic unit of the neural network, (also referred to as a perceptron) is a mathematical function. It takes one or more inputs that are multiplied by values called “weights” and added together. This value is then passed to a non-linear function, known as an activation function, to become the neuron’s output.
@ -91,7 +91,7 @@ https://victorzhou.com/blog/intro-to-neural-networks/
# Exerice 2 Neural network
The goal of this exercice is to understand how to combine three neurons to form a neural network. A neural newtwork is nothing else than neurons connected together. As shown in the figure the neural network is composed of **layers**:
The goal of this exercise is to understand how to combine three neurons to form a neural network. A neural newtwork is nothing else than neurons connected together. As shown in the figure the neural network is composed of **layers**:
- Input layer: it only represents input data. **It doesn't contain neurons**.
- Output layer: it represents the last layer. It contains a neuron (in some cases more than 1).
@ -99,7 +99,7 @@ The goal of this exercice is to understand how to combine three neurons to form
Notice that the neuron **o1** in the output layer takes as input the output of the neurons **h1** and **h2** in the hidden layer.
In exercice 1, you implemented this neuron.
In exercise 1, you implemented this neuron.
![alt text][neuron]
[neuron]: images/day1/ex2/w3_day1_neuron.png "Plot"
@ -143,9 +143,9 @@ Now, we add two more neurons:
1. This question is validated the output is: **0.9524917424084265**
# Exercice 3 Log loss
# Exercise 3 Log loss
The goal of this exercice is to implement the Log loss function. As mentioned last week, this function is used in classification as a **loss function**. It means that the better the classifier is, the smaller the loss function is. W2D1, you implemented the gradient descent on the MSE loss to update the weights of the linear regression. Similarly, the minimization of the Log loss leads to finding optimal weights.
The goal of this exercise is to implement the Log loss function. As mentioned last week, this function is used in classification as a **loss function**. It means that the better the classifier is, the smaller the loss function is. W2D1, you implemented the gradient descent on the MSE loss to update the weights of the linear regression. Similarly, the minimization of the Log loss leads to finding optimal weights.
Log loss: - 1/n * Sum[(y_true*log(y_pred) + (1-y_true)*log(1-y_pred))]
@ -163,7 +163,7 @@ https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html
1. This question is validated if the output is: **0.5472899351247816**.
# Exercice 4 Forward propagation
# Exercise 4 Forward propagation
The goal of this exerice is to compute the log loss on the output of the forward propagation. The data used is the tiny data set below.
@ -198,9 +198,9 @@ The goal if the network is to predict the success at the exam given math and che
2. This question is validated if the logloss for the 4 students is **0.5485133607757963**.
# Exercice 5 Regression
# Exercise 5 Regression
The goal of this exercice is to learn to adapt the output layer to regression.
The goal of this exercise is to learn to adapt the output layer to regression.
As a reminder, one of reasons for which the sigmoid is used in classification is because it contracts the output between 0 and 1 which is the expected output range for a probability (W2D2: Logistic regression). However, the output of the regression is not a probability.
In order to perform a regression using a neural network, the activation function of the neuron on the output layer has to be modified to **identity function**. In mathematics, the identity function is: **f(x) = x**. In other words it means that it returns the input as so. The three steps become:
@ -218,7 +218,7 @@ In order to perform a regression using a neural network, the activation function
All other neurons' activation function **doesn't change**.
1. Adapt the neuron class implemented in exercice 1. It now takes as a parameter `regression` which is boolean. When its value is `True`, `feedforward` should use the identity function as activation function instead of the sigmoid function.
1. Adapt the neuron class implemented in exercise 1. It now takes as a parameter `regression` which is boolean. When its value is `True`, `feedforward` should use the identity function as activation function instead of the sigmoid function.
```

Loading…
Cancel
Save