Browse Source

Merge branch 'master' into day-02-testing

pull/2/head
brad-gh 3 years ago committed by GitHub
parent
commit
2c1ca93a01
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
  1. 1600
      one_md_per_day_format/piscine/Week1/data/D01/ex8/winequality-red.csv
  2. 72
      one_md_per_day_format/piscine/Week1/data/D01/ex8/winequality.names
  3. 0
      one_md_per_day_format/piscine/Week1/data/D01/ex9/model_forecasts.txt
  4. 321
      one_md_per_day_format/piscine/Week1/day1.md
  5. 5
      one_md_per_day_format/piscine/Week1/day2.md
  6. 241
      one_md_per_day_format/piscine/Week1/day3.md
  7. 159
      one_md_per_day_format/piscine/Week1/day4.md
  8. 20
      one_md_per_day_format/piscine/Week1/day5.md
  9. 32
      one_md_per_day_format/piscine/Week2/day03.md
  10. 22
      one_md_per_day_format/piscine/Week2/day05.md
  11. 361
      one_md_per_day_format/piscine/Week2/day1.md
  12. 28
      one_md_per_day_format/piscine/Week2/day2.md
  13. 28
      one_md_per_day_format/piscine/Week2/day4.md
  14. 10
      one_md_per_day_format/piscine/Week2/template.md
  15. 10
      one_md_per_day_format/piscine/Week3/template.md
  16. 18
      one_md_per_day_format/piscine/Week3/w3day02.md
  17. 22
      one_md_per_day_format/piscine/Week3/w3day03.md
  18. 40
      one_md_per_day_format/piscine/Week3/w3day04.md
  19. 24
      one_md_per_day_format/piscine/Week3/w3day05.md
  20. 20
      one_md_per_day_format/piscine/Week3/w3day1.md

1600
one_md_per_day_format/piscine/Week1/data/D01/ex8/winequality-red.csv

File diff suppressed because it is too large diff.load

72
one_md_per_day_format/piscine/Week1/data/D01/ex8/winequality.names

@ -0,0 +1,72 @@
Citation Request:
This dataset is public available for research. The details are described in [Cortez et al., 2009].
Please include this citation if you plan to use this database:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016
[Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf
[bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib
1. Title: Wine Quality
2. Sources
Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009
3. Past Usage:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
In the above reference, two datasets were created, using red and white wine samples.
The inputs include objective tests (e.g. PH values) and the output is based on sensory data
(median of at least 3 evaluations made by wine experts). Each expert graded the wine quality
between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model
these datasets under a regression approach. The support vector machine model achieved the
best results. Several metrics were computed: MAD, confusion matrix for a fixed error tolerance (T),
etc. Also, we plot the relative importances of the input variables (as measured by a sensitivity
analysis procedure).
4. Relevant Information:
The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine.
For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009].
Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables
are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
These datasets can be viewed as classification or regression tasks.
The classes are ordered and not balanced (e.g. there are munch more normal wines than
excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent
or poor wines. Also, we are not sure if all input variables are relevant. So
it could be interesting to test feature selection methods.
5. Number of Instances: red wine - 1599; white wine - 4898.
6. Number of Attributes: 11 + output attribute
Note: several of the attributes may be correlated, thus it makes sense to apply some sort of
feature selection.
7. Attribute information:
For more information, read [Cortez et al., 2009].
Input variables (based on physicochemical tests):
1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
8. Missing Attribute Values: None

0
one_md_per_day_format/piscine/Week1/data/D01/ex6/model_forecasts.txt → one_md_per_day_format/piscine/Week1/data/D01/ex9/model_forecasts.txt

321
one_md_per_day_format/piscine/Week1/day1.md

@ -1,38 +1,43 @@
# D01 Piscine AI - Data Science
# D01 Piscine AI - Data Science
The goal of this day is to understand practical usage of **NumPy**. **NumPy** is a commonly used Python data analysis package. By using **NumPy**, you can speed up your workflow, and interface with other packages in the Python ecosystem, like scikit-learn, that use **NumPy** under the hood. **NumPy** was originally developed in the mid 2000s, and arose from an even older package called Numeric. This longevity means that almost every data analysis or machine learning package for Python leverages **NumPy** in some way.
Version of NumPy I used to do the exercices: 1.18.1
Version of NumPy I used to do the exercises: 1.18.1
I suggest to use the most recent one.
Author:
Author:
<div style="page-break-after: always"></div>
# Outline: (optional)
A. Introduction
B. Rules
C. Exercices
C. Exercises
## Rules
... Notebook Colabs or Jupyter Notebook
Save one notebook per day or one per exercice. Use markdown to divide your notebook in different exercices.
## Ressources
... Notebook Colabs or Jupyter Notebook
Save one notebook per day or one per exercise. Use markdown to divide your notebook in different exercises.
## Ressources
- https://medium.com/fintechexplained/why-should-we-use-NumPy-c14a4fb03ee9
- https://docs.scipy.org/doc/NumPy-1.15.0/reference/
- https://numpy.org/doc/
- https://jakevdp.github.io/PythonDataScienceHandbook/
# Exercice 1 Your first NumPy array
# Exercise 1 Your first NumPy array
The goal of this exercice is to use many Python data types in **NumPy** arrays. **NumPy** arrays are intensively used in **NumPy** and **Pandas**. They are flexible and allow to use optimized **NumPy** underlying functions.
The goal of this exercise is to use many Python data types in **NumPy** arrays. **NumPy** arrays are intensively used in **NumPy** and **Pandas**. They are flexible and allow to use optimized **NumPy** underlying functions.
1. Create a NumPy array that contains: an integer, a float, a string, a dictionary, a list, a tuple, a set and a boolean.
1. Create a NumPy array that contains: an integer, a float, a string, a dictionary, a list, a tuple, a set and a boolean.
The expected output is:
The expected output is:
```
```python
for i in your_np_array:
print(type(i))
@ -44,13 +49,13 @@ for i in your_np_array:
<class 'tuple'>
<class 'set'>
<class 'bool'>
```
## Correction
1. This question is validated if the your_numpy_array is a NumPy array. It can be checked with `type(your_numpy_array)` that should be equal to `numpy.ndarray`. And if the type of is element are as follow.
## Correction
1. This question is validated if the your_numpy_array is a NumPy array. It can be checked with `type(your_numpy_array)` that should be equal to `numpy.ndarray`. And if the type of is element are as follow.
```
```python
for i in your_np_array:
print(type(i))
@ -62,27 +67,30 @@ for i in your_np_array:
<class 'tuple'>
<class 'set'>
<class 'bool'>
```
---
# Exercice 2 Zeros
The goal of this exercice is to learn to create a NumPy array with 0s.
# Exercise 2 Zeros
1. Create a NumPy array of dimension **300** with zeros without filling it manually
The goal of this exercise is to learn to create a NumPy array with 0s.
1. Create a NumPy array of dimension **300** with zeros without filling it manually
2. Reshape it to **(3,100)**
## Correction
1. The question is validated is the solution uses `np.zeros` and if the shape of the array is `(300,)`
2. The question is validated if the solution uses `reshape` and the shape of the array is **(3, 100)**
2. The question is validated if the solution uses `reshape` and the shape of the array is `(3, 100)`
---
# Exercice 3 Slicing
The goal of this exercice is to learn NumPy indexing/slicing. It allows to access values of the NumPy array efficiently and without a for loop.
# Exercise 3 Slicing
The goal of this exercise is to learn NumPy indexing/slicing. It allows to access values of the NumPy array efficiently and without a for loop.
1. Create a NumPy array of dimension 1 that contains all integers from 1 to 100 ordered.
1. Create a NumPy array of dimension 1 that contains all integers from 1 to 100 ordered.
2. Without using a for loop and using the array created in Q1, create an array that contain all odd integers. The expected output is: `np.array([1,3,...,99])`. *Hint*: it takes one line
3. Without using a for loop and using the array created in Q1, create an array that contain all even integers reversed. The expected output is: `np.array([100,98,...,2])`. *Hint*: it takes one line
@ -90,47 +98,50 @@ The goal of this exercice is to learn NumPy indexing/slicing. It allows to acces
## Correction
1. This question is validated if the solution doesn't involve a for loop or writing all integers from 1 to 100 and if the array is: `np.array([1,...,100])`. The list from 1 to 100 can be generated with an iterator: `range`.
1. This question is validated if the solution doesn't involve a for loop or writing all integers from 1 to 100 and if the array is: `np.array([1,...,100])`. The list from 1 to 100 can be generated with an iterator: `range`.
2. This question is validated if the solution is: `integers[1::2]`
2. This question is validated if the solution is: `integers[::2]`
3. This question is validated if the solution is: `integers[::-2]`
4. This question is validated if the array is: `np.array([[1,0,3,4,0,...,0,99,100]])`. There are at least two ways to get this results without for loop. The first one uses `integers[1::3] = 0` and the second involves creating a boolean array that indexes the array:
4. This question is validated if the array is: `np.array([0, 1,0,3,4,0,...,0,99,100])`. There are at least two ways to get this results without for loop. The first one uses `integers[1::3] = 0` and the second involves creating a boolean array that indexes the array:
```python
mask = (integers+1)%3 == 0
integers[mask] = 0
```
```
mask = (integers+1)%3 == 0
integers[mask] = 0
```
---
# Exercice 4 Random
The goal of this exercice is to learn to generate random data.
In Data Science it is extremely useful to generate random data for many reasons:
Lack of real data, create a random benchmark, use varied data sets.
NumPy proposes a lot of options to generate random data. In statistics, assumptions are made on the distribution the data is from. All data distribution that can be generated randomly are described in the documentation. In this exerice we will focus on two distributions:
# Exercise 4 Random
The goal of this exercise is to learn to generate random data.
In Data Science it is extremely useful to generate random data for many reasons:
Lack of real data, create a random benchmark, use varied data sets.
NumPy proposes a lot of options to generate random data. In statistics, assumptions are made on the distribution the data is from. All data distribution that can be generated randomly are described in the documentation. In this exercise we will focus on two distributions:
- Uniform: For example, if your goal is to generate a random number from 1 to 100 and that the probability that all the numbers is equal you'll need the uniform distribution. NumPy provides `randint` and `uniform` to generate uniform distribution
- Normal: The normal distribution is the most important probability distribution in statistics because it fits many natural phenomena.For example, if you need to generate a data sample that represents **Heights of 14 Year Old Girls** it can be done using the normal distribution. In that case, we need two parameters: the mean (1m51) and the standard deviation (0.0741m). NumPy provides `randn` to generate normal distribution (among other)
https://docs.scipy.org/doc/NumPy-1.15.0/reference/routines.random.html
https://numpy.org/doc/stable/reference/random/generator.html
1. Set the seed to 888
2. Generate a **one-dimensional** array of size 100 with a normal distribution
2. Generate a **one-dimensional** array of size 100 with a normal distribution
3. Generate a **two-dimensional** array of size 8,8 with random integers from 1 to 10 - both included (same probability for each integer)
4. Generate a **three-dimensional** of size 4,2,5 array with random integers from 1 to 17 - both included (same probability for each integer)
## Correction
## Correction:
For this exercice, as the results may change depending on the version of the package or the OS, I give the code to correct the exercice. If the code is correct and the output is not the same as mine, it is accepted.
For this exercise, as the results may change depending on the version of the package or the OS, I give the code to correct the exercise. If the code is correct and the output is not the same as mine, it is accepted.
1. The solution is accepted if the solution is: `np.random.seed(888)`
2. The solution is accepted if the solution is `np.random.randn(100)`. The value of the first element is `0.17620087373662233`.
2. The solution is accepted if the solution is `np.random.randn(100)`. The value of the first element is `0.17620087373662233`.
3. The solution is accepted if the solution is `np.random.randint(1,11,(8,8))`.
```
```console
Given the NumPy version and the seed, you should have this output:
array([[ 7, 4, 8, 10, 2, 1, 1, 10],
@ -141,10 +152,11 @@ For this exercice, as the results may change depending on the version of the pac
[ 4, 1, 9, 7, 1, 4, 3, 5],
[ 3, 2, 10, 8, 6, 3, 9, 4],
[ 4, 4, 9, 2, 8, 5, 9, 5]])
```
```
4. The solution is accepted if the solution is `np.random.randint(1,18,(4,2,5))`.
```
```console
Given the NumPy version and the seed, you should have this output:
array([[[14, 16, 8, 15, 14],
@ -158,52 +170,58 @@ For this exercice, as the results may change depending on the version of the pac
[[ 3, 10, 5, 16, 13],
[17, 12, 9, 7, 16]]])
```
```
---
# Exercice 5: Split, contenate, reshape arrays
The goal of this exercice is to learn to concatenate and reshape arrays.
# Exercise 5: Split, concatenate, reshape arrays
The goal of this exercise is to learn to concatenate and reshape arrays.
1. Generate an array with integers from 1 to 50: `array([1,...,50])`
2. Generate an array with integers from 51 to 100: `array([51,...,100])`
3. Using `np.concatenate`, concatenate the two arrays into: `array([1,...,100])`
4. Reshape the previous array into:
```
4. Reshape the previous array into:
```console
array([[ 1, ... , 10],
...
[ 91, ... , 100]])
```
## Correction:
## Correction
1. This question is validated if the generated array is based on an iterator as `range` or `np.arange`. Check that 50 is part of the array.
1. This question is validated if the generated array is based on an iterator as `range` or `np.arange`. Check that 50 is part of the array.
2. This question is validated if the generated array is based on an iterator as `range` or `np.arange`. Check that 100 is part of the array.
2. This question is validated if the generated array is based on an iterator as `range` or `np.arange`. Check that 100 is part of the array.
3. This question is validated if you concatenated this way `np.concatenate(array1,array2)`.
4. This question is validated if the result is:
```
```console
array([[ 1, ... , 10],
...
[ 91, ... , 100]])
```
The easiest way is to use `array.reshape(10,10)`.
https://jakevdp.github.io/PythonDataScienceHandbook/02.02-the-basics-of-NumPy-arrays.html
The easiest way is to use `array.reshape(10,10)`.
https://jakevdp.github.io/PythonDataScienceHandbook/ (section: The Basics of NumPy Arrays)
---
# Exercice 6: Broadcasting and Slicing
The goal of this exercice is to learn to access values of n-dimensional arrays and efficiently.
# Exercise 6: Broadcasting and Slicing
The goal of this exercise is to learn to access values of n-dimensional arrays efficiently.
1. Create an 2-dimensional array size 9,9 of 1s. Each value has to be an `int8`.
2. Using **slicing**, output this array:
```
```python
array([[1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 0, 0, 0, 0, 0, 0, 0, 1],
[1, 0, 1, 1, 1, 1, 1, 0, 1],
@ -215,16 +233,16 @@ The goal of this exercice is to learn to access values of n-dimensional arrays a
[1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int8)
```
https://jakevdp.github.io/PythonDataScienceHandbook/02.05-computation-on-arrays-broadcasting.html
https://jakevdp.github.io/PythonDataScienceHandbook/ (section: Computation on Arrays: Broadcasting)
## Correction
## Correction
1. The question is validated if the output is the same as:
1. The question is validated if the output is the same as:
`np.ones([9,9], dtype=np.int8)`
2. The question is validated if the ouput is
2. The question is validated if the output is
```
```console
array([[1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 0, 0, 0, 0, 0, 0, 0, 1],
[1, 0, 1, 1, 1, 1, 1, 0, 1],
@ -235,96 +253,108 @@ https://jakevdp.github.io/PythonDataScienceHandbook/02.05-computation-on-arrays-
[1, 0, 0, 0, 0, 0, 0, 0, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int8)
```
The solution is not accepted if the values of the array have been changed one by one manually. The usage of the for loop is not allowed neither.
Here is an example of solution:
```
The solution is not accepted if the values of the array have been changed one by one manually. The usage of the for loop is not allowed neither.
Here is an example of a possible solution:
```console
x[1:8,1:8] = 0
x[2:7,2:7] = 1
x[3:6,3:6] = 0
x[4,4] = 1
```
---
# Exercice 7: NaN
The goal of this exercice is to learn to deal with missing data in NumPy and to manipulate NumPy arrays.
# Exercise 7: NaN
Let us consider a 2-dimensional array that contains the grades at the past two exams. Some of the students missed the first exam. As the grade is missing it has been replaced with a NaN.
The goal of this exercise is to learn to deal with missing data in NumPy and to manipulate NumPy arrays.
1. Using `np.where` create a third column that is equal to the grade of the first exam if it exists and the second else. Add the column as the third column of the array.
Let us consider a 2-dimensional array that contains the grades at the past two exams. Some of the students missed the first exam. As the grade is missing it has been replaced with a `NaN`.
**Using a for loop or if/else statement is not allowed in this exercice.**
1. Using `np.where` create a third column that is equal to the grade of the first exam if it exists and the second else. Add the column as the third column of the array.
```
**Using a for loop or if/else statement is not allowed in this exercise.**
```python
import numpy as np
generator = np.random.default_rng(123)
grades = np.round(generator.uniform(low = 0.0, high = 10.0, size = (10, 2)))
grades[[1,2,5,7], [0,0,0,0]] = np.nan
print(grades)
```
## Correction
1. There are two steps in this exercise:
- Create the vector that contains the grade of the first exam if available or the second. This can be done using `np.where`:
```python
np.where(np.isnan(grades[:, 0]), grades[:, 1], grades[:, 0])
```
## Correction
- Add this vector as third column of the array. Here are two ways:
1. There are two steps in this exercice:
- Create the vector that contains the the grade of the first exam if available or the second. This can be done using `np.where`:
```
np.where(np.isnan(grades[:, 0]), grades[:, 1], grades[:, 0])
```
- Add this vector as third column of the array. Here are two ways:
```
np.insert(arr = grades, values = new_vector, axis = 1, obj = 2)
```python
np.insert(arr = grades, values = new_vector, axis = 1, obj = 2)
np.hstack((grades, new_vector[:, None]))
```
np.hstack((grades, new_vector[:, None]))
```
This question is validated if, without having used a for loop or having filled the array manually, the output is:
This question is validated if, without having used a for loop or having filled the array manually, the output is:
This question is validated if, without having used a for loop or having filled the array manually, the output is:
```console
[[ 7. 1. 7.]
[nan 2. 2.]
[nan 8. 8.]
[ 9. 3. 9.]
[ 8. 9. 8.]
[nan 2. 2.]
[ 8. 2. 8.]
[nan 6. 6.]
[ 9. 2. 9.]
[ 8. 5. 8.]]
```
```
[[ 7. 1. 7.]
[nan 2. 2.]
[nan 8. 8.]
[ 9. 3. 9.]
[ 8. 9. 8.]
[nan 2. 2.]
[ 8. 2. 8.]
[nan 6. 6.]
[ 9. 2. 9.]
[ 8. 5. 8.]]
```
https://jakevdp.github.io/PythonDataScienceHandbook/02.02-the-basics-of-NumPy-arrays.html
---
# Exercice 8: Wine
The goal of this exercice is to learn to perform a basic data analysis on real data using NumPy.
# Exercise 8: Wine
The data set that will be used for this exercice is the wine data set.
https://archive.ics.uci.edu/ml/datasets/wine+quality
The goal of this exercise is to learn to perform a basic data analysis on real data using NumPy.
The data set that will be used for this exercise is the red wine data set.
https://archive.ics.uci.edu/ml/datasets/wine+quality
How to tell if a given 2D array has null columns?
1. Using `genfromtxt` load the data and reduce the size of the numpy array by optimizing the types. The sum of absolute differences between the original data set and the "memory" optimized one has to be smaller than 1.10**-3. I suggest to use `np.float32`. Check that the numpy array weights **76800 bytes**.
2. Print 2nd, 7th and 12th rows as a two dimensional array
2. Print 2nd, 7th and 12th rows as a two dimensional array
3. Is there any wine with a percentage of alcohol greater than 20% ? Return True or False
4. What is the average % of alcohol on all wines in the data set ? If needed, drop `np.nan` values
5. Compute the minimum, the maximum, the 25th percentile, the 75 percentile, the median of the pH
6. Compute the average quality of the wines having the 20% least sulphates
7. Compute the mean of all variables for wines having the best quality. Same question for the wines having the worst quality
## Correction
5. Compute the minimum, the maximum, the 25th percentile, the 50th percentile, the 75th percentile, the median (50th percentile) of the pH
1. This question is validated if the text file has successfully been loaded in a NumPy array with
` genfromtxt('winequality-red.csv', delimiter=',')` and the reduced arrays weights **76800 bytes**
6. Compute the average quality of the wines having the 20% least sulphates
2. This question is validated if the output is
7. Compute the mean of all variables for wines having the best quality. Same question for the wines having the worst quality
## Correction
```
1. This question is validated if the text file has successfully been loaded in a NumPy array with
`genfromtxt('winequality-red.csv', delimiter=',')` and the reduced arrays weights **76800 bytes**
2. This question is validated if the output is
```python
array([[ 7.4 , 0.7 , 0. , 1.9 , 0.076 , 11. , 34. ,
0.9978, 3.51 , 0.56 , 9.4 , 5. ],
[ 7.4 , 0.66 , 0. , 1.8 , 0.075 , 13. , 40. ,
@ -332,15 +362,16 @@ How to tell if a given 2D array has null columns?
[ 6.7 , 0.58 , 0.08 , 1.8 , 0.097 , 15. , 65. ,
0.9959, 3.28 , 0.54 , 9.2 , 5. ]])
```
This slicing gives the answer `my_data[[1,6,11],:]`.
3. This question is validated if the answer if False. There many ways to get the answer: find the maximum or check values greater than 20.
This slicing gives the answer `my_data[[1,6,11],:]`.
3. This question is validated if the answer if False. There many ways to get the answer: find the maximum or check values greater than 20.
4. This question is validated if the answer is 10.422983114446529.
4. This question is validated if the answer is 10.422983114446529.
5. This question is validated if the answers is:
```
```console
pH stats
25 percentile: 3.21
50 percentile: 3.31
@ -349,58 +380,62 @@ How to tell if a given 2D array has null columns?
min: 2.74
max: 4.01
```
*Note: Using `percentile` or `median` may give different results depending on the duplicate values in the column. If you do not have my results please use `percentile`.*
6. This question is validated if the answer is `5.222222222222222`. The first step is to get the percentile 20% of the column `sulphates`, then create a boolean array that contains `True` of the value is smaller than the percentile 20%, then select this rows with the column quality and compute the `mean`.
7. This question is validated if the output for the best wines is:
> *Note: Using `percentile` or `median` may give different results depending on the duplicate values in the column. If you do not have my results please use `percentile`.*
```
array([ 8.56666667, 0.42333333, 0.39111111, 2.57777778, 0.06844444,
13.27777778, 33.44444444, 0.99521222, 3.26722222, 0.76777778,
12.09444444, 8. ])
```
6. This question is validated if the answer is ~`5.2`. The first step is to get the percentile 20% of the column `sulphates`, then create a boolean array that contains `True` of the value is smaller than the percentile 20%, then select this rows with the column quality and compute the `mean`.
And the output for the bad wines is:
7. This question is validated if the output for the best wines is:
```python
array([ 8.56666667, 0.42333333, 0.39111111, 2.57777778, 0.06844444,
13.27777778, 33.44444444, 0.99521222, 3.26722222, 0.76777778,
12.09444444, 8. ])
```
```
array([ 8.36 , 0.8845 , 0.171 , 2.635 , 0.1225 , 11. ,
24.9 , 0.997464, 3.398 , 0.57 , 9.955 , 3. ])
```
And the output for the bad wines is:
```python
array([ 8.36 , 0.8845 , 0.171 , 2.635 , 0.1225 , 11. ,
24.9 , 0.997464, 3.398 , 0.57 , 9.955 , 3. ])
```
This can be done in three steps: Get the max, create a boolean mask that indicates rows with max quality, use this mask to subset the rows with the best quality and compute the mean on the axis 0.
This can be done in three steps: Get the max, create a boolean mask that indicates rows with max quality, use this mask to subset the rows with the best quality and compute the mean on the axis 0.
---
## Exercice 9 Football tournament
The goal of this exercice is to learn to use permutations, complex
## Exercise 9 Football tournament
A Football tournament is organized in your city. There are 10 teams and the director of the tournaments wants you to create a first round as exciting as possible. To do so, you are allowed to choose the pairs. As a former data scientist, you implemented a model based on teams' current season performance. This models predicts the score difference between two teams. You used this algorithm to predict the score difference for every possible pair.
The matrix returned is a 2-dimensional array that contains in (i,j) the score difference between team i and j. The matrix is in `model_forecasts.txt`.
The goal of this exercise is to learn to use permutations, complex
Using this output, what are the pairs that will give the most intersting matches ?
A Football tournament is organized in your city. There are 10 teams and the director of the tournaments wants you to create a first round as exciting as possible. To do so, you are allowed to choose the pairs. As a former data scientist, you implemented a model based on teams' current season performance. This models predicts the score difference between two teams. You used this algorithm to predict the score difference for every possible pair.
The matrix returned is a 2-dimensional array that contains in (i,j) the score difference between team i and j. The matrix is in `model_forecasts.txt`.
If a team wins 7-1 the match is obviously less exciting than a match where the winner wins 2-1.
The criteria that correponds to **the pairs that will give the most intersting matches** is **the pairs that minimize the sum of squared differences**
Using this output, what are the pairs that will give the most interesting matches ?
The expected output is:
If a team wins 7-1 the match is obviously less exciting than a match where the winner wins 2-1.
The criteria that corresponds to **the pairs that will give the most interesting matches** is **the pairs that minimize the sum of squared differences**
```
The expected output is:
```console
[[m1_t1 m2_t1 m3_t1 m4_t1 m5_t1]
[m1_t2 m2_t2 m3_t2 m4_t2 m5_t2]]
```
- m1_t1 stands for match1_team1
- m1_t1 plays against m1_t2 ...
- m1_t1 stands for match1_team1
- m1_t1 plays against m1_t2 ...
**Usage of for loop is not allowed, you may need to use the library** `itertools` **to create permutations**
https://docs.python.org/3.9/library/itertools.html
## Correction
This exercice is validated if the output is:
This exercise is validated if the output is:
```
```console
[[0 3 1 2 4]
[7 6 8 9 5]]
```

5
one_md_per_day_format/piscine/Week1/day2.md

@ -80,9 +80,10 @@ and if the types of the first value of the columns are
<class 'float'>
```
# Exercice 2 **Electric power consumption**
# Exercise 2 **Electric power consumption**
The goal of this exercise is to learn to manipulate real data with Pandas.
The data set used is **Individual household electric power consumption**
1. Delete the columns `Time`, `Sub_metering_2` and `Sub_metering_3`
@ -223,7 +224,7 @@ Questions:
## Correction
The validate this exercise all answers should return the expected numerical value given in the correction AND uses Pandas. For example using NumPy to compute the mean doesn't respect the philosophy of the exercise which is to use Pandas.
To validate this exercise all answers should return the expected numerical value given in the correction AND uses Pandas. For example using NumPy to compute the mean doesn't respect the philosophy of the exercise which is to use Pandas.
1. How many rows and columns are there?**10000 entries** and **14 columns**

241
one_md_per_day_format/piscine/Week1/day3.md

@ -1,45 +1,42 @@
# D03 Piscine AI - Data Science
Author:
# D03 Piscine AI - Data Science
Author:
# Introduction
While working on a dataset it is important to check the distribution of the data. Obviously, for most of humans it is difficult to visualize the data in more than 3 dimensions
Viz is important to understand the data and to show results. We have already seen there are some basinc viz functionalities in Pandas.
Now we'll discover two of the most know viz libraries in Python:
Viz is important to understand the data and to show results. We have already seen there are some basinc viz functionalities in Pandas.
Now we'll discover two of the most know viz libraries in Python:
- Pandas viz
- Matplotlib
- Plotly
Pandas viz is pratique: rapid plot, relies on Matplotlib. (check matplotlib doc sometimes not all params are detailed in pandas doc)
For more elaborate plots Matplotlib is necessary
For more elaborate plots Matplotlib is necessary
And finaly Plotly is a interactive plot library.
And finaly Plotly is a interactive plot library.s
## Rules
Always a title, legend, ...
## Ressources
Always a title, legend, ...s
## Ressources
s
https://matplotlib.org/3.3.3/tutorials/index.html
https://towardsdatascience.com/matplotlib-tutorial-learn-basics-of-pythons-powerful-plotting-library-b5d1b8f67596
https://github.com/rougier/matplotlib-tutorial
https://jakevdp.github.io/PythonDataScienceHandbook/05.13-kernel-density-estimation.html
# Exercise 1 Pandas plot 1
The goal of this exercise is to learn to create plots with use Pandas. Panda's `.plot()` is a wrapper for `matplotlib.pyplot.plot()`.
# Exercice 1 Pandas plot 1
The goal of this exercice is to learn to create plots with use Pandas. Panda's `.plot()` is a wrapper for `matplotlib.pyplot.plot()`.
Here is the data we will be using:
Here is the data we will be using:
```
```python
df = pd.DataFrame({
'name':['christopher','marion','maria','mia','clement','randy','remi'],
'age':[70,30,22,19,45,33,20],
@ -49,137 +46,134 @@ Here is the data we will be using:
'num_pets':[5,1,0,5,2,2,3]
})
```
1. Reproduce this plot. This plot is called a bar plot
1. Reproduce this plot. This plot is called a bar plot
![alt text][logo]
[logo]: images/day03/w1day03_ex1_plot1.png "Bar plot ex1"
The plot has to contain:
The plot has to contain:
- the title
- name on x-axis
- name on x-axis
- legend
## Correction
1. This question is validated if the plot reproduces the plot in the image. It has to contain a title, an x-axis name and a legend.
1. This question is validated if the plot reproduces the plot in the image. It has to contain a title, an x-axis name and a legend.
![alt text][logo]
[logo]: images/day03/w1day03_ex1_plot1.png "Bar plot ex1"
## Exercise 2: Pandas plot 2
## Exercice 2: Pandas plot 2
The goal of this exercise is to learn to create plots with use Pandas. Panda's `.plot()` is a wrapper for `matplotlib.pyplot.plot()`.
The goal of this exercice is to learn to create plots with use Pandas. Panda's `.plot()` is a wrapper for `matplotlib.pyplot.plot()`.
```
```python
df = pd.DataFrame({
'name':['christopher','marion','maria','mia','clement','randy','remi'],
'age':[70,30,22,19,45,33,20],
'gender':['M','F','F','F','M','M','M'],
'state':['california','dc','california','dc','california','new york','porto'],
'num_children':[2,0,0,3,8,1,4],
'num_pets':[5,1,0,5,2,2,3]
'num_children':[4,2,1,0,3,1,0],
'num_pets':[5,1,0,2,2,2,3]
})
```
1. Reproduce this plot. This plot is called a scatter plot. Do you observe a relationship between the age and the number of children ?
1. Reproduce this plot. This plot is called a scatter plot. Do you observe a relationship between the age and the number of children ?
![alt text][logo_ex2]
[logo_ex2]: images/day03/w1day03_ex2_plot1.png "Scatter plot ex2"
The plot has to contain:
The plot has to contain:
- the title
- name on x-axis
- name on y-axis
- name on x-axis
- name on y-axis
## Correction
1. This question is validated if the plot reproduces the plot in the image. It has to contain a title, an x-axis name and an y-axis name.
You should also observe that the older people are the bigger the number of children is.
You should also observe that the older people are, the the more children they have.
![alt text][logo_ex2]
[logo_ex2]: images/day03/w1day03_ex2_plot1.png "Scatter plot ex2"
## Exercise 3 Matplotlib 1
The goal of this plot is to learn to use Matplotlib to plot data. As you know, Matplotlib is the underlying library used by Pandas. It provides more options to plot custom visualizations. Howerver, most of the plots we will create with Matplotlib can be reproduced with Pandas' `.plot()`.
## Exercice 3 Matplotlib 1
The goal of this plot is to learn to use Matplotlib to plot data. As you know, Matplotlib is the underlying library used by Pandas. It provides more options to plot custom visualizations. Howerver, most of the plots we will create with Matplotlib can be reproduced with Pandas' `.plot()`.
1. Reproduce this plot. We assume the datapoints have integers coordinates.
1. Reproduce this plot. We assume the data points have integers coordinates.
![alt text][logo_ex3]
[logo_ex3]: images/day03/w1day03_ex3_plot1.png "Scatter plot ex3"
The plot has to contain:
The plot has to contain:
- the title
- name on x-axis and y-axis
- x-axis and y-axis are limited to [1,8]
- **style**:
- red dashdot line with a width of 3
- blue circles with a size of 12
- **style**:
- red dashdot line with a width of 3
- blue circles with a size of 12
## Correction
## Correction
1. This question is validated if the plot reproduces the plot in the image and respect those criterias
1. This question is validated if the plot reproduces the plot in the image and respect those criteria
- the title
- name on x-axis and y-axis
- x-axis and y-axis are limited to [1,8]
- **style**:
- red dashdot line with a width of 3
- blue circles with a size of 12
- the title
- name on x-axis and y-axis
- x-axis and y-axis are limited to [1,8]
- **style**:
- red dashdot line with a width of 3
- blue circles with a size of 12
![alt text][logo_ex3]
[logo_ex3]: images/day03/w1day03_ex3_plot1.png "Scatter plot ex3"
# Exercice 4 Matplotlib 2
The goal of this plot is to learn to use Matplotlib to plot different lines in the same plot on different axis using `twinx`. This very useful to compare variables in different ranges.
# Exercise 4 Matplotlib 2
Here is the data:
The goal of this plot is to learn to use Matplotlib to plot different lines in the same plot on different axis using `twinx`. This very useful to compare variables in different ranges.
```
Here is the data:
```python
left_data = [5, 7, 11, 13, 17]
right_data = [0.1, 0.2, 0.4, 0.8, -1.6]
x_axis = [0.0, 1.0, 2.0, 3.0, 4.0]
```
1. Reproduce this plot
![alt text][logo_ex4]
[logo_ex4]: images/day03/w1day03_ex4_plot1.png "Twin axis plot ex4"
The plot has to contain:
The plot has to contain:
- the title
- name on left y-axis and right y-axis
- **style**:
- left data in black
- right data in red
- **style**:
- left data in black
- right data in red
## Correction
## Correction
1. This question is validated if the plot reproduces the plot in the image and respect those criterias
1. This question is validated if the plot reproduces the plot in the image and respect those criteria
The plot has to contain:
The plot has to contain:
- the title
- name on left y-axis and right y-axis
- **style**:
- left data in black
- right data in red
- the title
- name on left y-axis and right y-axis
- **style**:
- left data in black
- right data in red
![alt text][logo_ex4]
@ -187,53 +181,55 @@ The plot has to contain:
https://matplotlib.org/gallery/api/two_scales.html
# Exercice 5 Matplotlib subplots
The goal of this exerice is to learn to use Matplotlib to create subplots.
# Exercise 5 Matplotlib subplots
The goal of this exercise is to learn to use Matplotlib to create subplots.
1. Reproduce this plot using a **for loop**:
1. Reproduce this plot using a **for loop**:
![alt text][logo_ex5]
[logo_ex5]: images/day03/w1day03_ex5_plot1.png "Subplots ex5"
The plot has to contain:
The plot has to contain:
- 6 subplots: 2 rows, 3 columns
- Keep space between plots: `hspace=0.5` and `wspace=0.5`
- Each plot contains
- Text (2,3,i) centered at 0.5, 0.5. *Hint*: check the parameter `ha` of `text`
- a title: Title i
- a title: Title i
## Correction
1. The question is validated if the plot reproduces the image and the given criterias:
1. The question is validated if the plot reproduces the image and the given criteria:
The plot has to contain:
The plot has to contain:
- 6 subplots: 2 rows, 3 columns
- Keep space between plots: `hspace=0.5` and `wspace=0.5`
- Each plot contains
- Text (2,3,i) centered at 0.5, 0.5. *Hint*: check the parameter `ha` of `text`
- a title: Title i
- a title: Title i
![alt text][logo_ex5]
[logo_ex5]: images/day03/w1day03_ex5_plot1.png "Subplots ex5"
Check that the plot has been created with a for loop.
Check that the plot has been created with a for loop.
# Exercice 6 Plotly 1
Plotly has evolved a lot in the previous years. It is important to **always check the documentation**.
# Exercise 6 Plotly 1
Plotly comes with a high level interface: Plotly Express. It helps building some complex plots easily. The lesson won't detail the complex examples. Plotly express is quite interesting while using Pandas Dataframes because there are some built-in functions that leverage Pandas Dataframes.
Plotly has evolved a lot in the previous years. It is important to **always check the documentation**.
The plot outputed by Plotly is interactive and can also be dynamic.
Plotly comes with a high level interface: Plotly Express. It helps building some complex plots easily. The lesson won't detail the complex examples. Plotly express is quite interesting while using Pandas Dataframes because there are some built-in functions that leverage Pandas Dataframes.
The goal of the exercice is to plot the price of a company. Its price is generated below.
The plot outputed by Plotly is interactive and can also be dynamic.
```
The goal of the exercise is to plot the price of a company. Its price is generated below.
```python
returns = np.random.randn(50)
price = 100 + np.cumsum(returns)
@ -242,41 +238,41 @@ df = pd.DataFrame(zip(dates, price),
columns=['Date','Company_A'])
```
1. Using Plotly express, reproduce the plot in the image. As the data is generated randomly I do not expect you to reproduce the same line.
1. Using Plotly express, reproduce the plot in the image. As the data is generated randomly I do not expect you to reproduce the same line.
![alt text][logo_ex6]
[logo_ex6]: images/day03/w1day03_ex6_plot1.png "Time series ex6"
The plot has to contain:
The plot has to contain:
- title
- title
- x-axis name
- yaxis name
2. Same question but now using `plotly.graph_objects`. You may need to use `init_notebook_mode` from `plotly.offline`.
2. Same question but now using `plotly.graph_objects`. You may need to use `init_notebook_mode` from `plotly.offline`.
https://plotly.com/python/time-series/e
## Correction
1. This question is validated if the plot is in the image is reproduced using Plotly express given those criterias:
1. This question is validated if the plot is in the image is reproduced using Plotly express given those criteria:
The plot has to contain:
The plot has to contain:
- a title
- a title
- x-axis name
- yaxis name
![alt text][logo_ex6]
[logo_ex6]: images/day03/w1day03_ex6_plot1.png "Time series ex6"
2. This question is validated if the plot is in the image is reproduced using `plotly.graph_objects` given those criterias:
2.This question is validated if the plot is in the image is reproduced using `plotly.graph_objects` given those criteria:
The plot has to contain:
The plot has to contain:
- a title
- a title
- x-axis name
- yaxis name
@ -284,34 +280,36 @@ The plot has to contain:
[logo_ex6]: images/day03/w1day03_ex6_plot1.png "Time series ex6"
# Exercice 7 Plotly Box plots
# Exercise 7 Plotly Box plots
The goal of this exercice is to learn to use Plotly to plot Box Plots. It is t is a method for graphically depicting groups of numerical data through their quartiles and values as min, max. It allows to compare quickly some variables.
The goal of this exercise is to learn to use Plotly to plot Box Plots. It is t is a method for graphically depicting groups of numerical data through their quartiles and values as min, max. It allows to compare quickly some variables.
Let us generate 3 random arrays from a normal distribution. And for each array add respectively 1, 2 to the normal distribution.
Let us generate 3 random arrays from a normal distribution. And for each array add respectively 1, 2 to the normal distribution.
```
```python
y0 = np.random.randn(50)
y1 = np.random.randn(50) + 1 # shift mean
y2 = np.random.randn(50) + 2
```
1. Plot in the same Figure 2 box plots as shown in the image. In this exercice the style is not important.
1. Plot in the same Figure 2 box plots as shown in the image. In this exercise the style is not important.
![alt text][logo_ex7]
[logo_ex7]: images/day03/w1day03_ex7_plot1.png "Box plot ex7"
The plot has to contain:
The plot has to contain:
- the title
- the legend
https://plotly.com/python/box-plots/
## Correction
## Correction
1. This question is validated if the plot is in the image is reproduced given those criterias:
1. This question is validated if the plot is in the image is reproduced given those criteria:
The plot has to contain:
The plot has to contain:
- the title
- the legend
@ -320,20 +318,19 @@ The plot has to contain:
[logo_ex7]: images/day03/w1day03_ex7_plot1.png "Box plot ex7"
```python
import plotly.graph_objects as go
import numpy as np
```
import plotly.graph_objects as go
import numpy as np
y0 = np.random.randn(50)
y1 = np.random.randn(50) + 1 # shift mean
y2 = np.random.randn(50) + 2
y0 = np.random.randn(50)
y1 = np.random.randn(50) + 1 # shift mean
y2 = np.random.randn(50) + 2
fig = go.Figure()
fig.add_trace(go.Box(y=y0, name='Sample A',
marker_color = 'indianred'))
fig.add_trace(go.Box(y=y1, name = 'Sample B',
marker_color = 'lightseagreen'))
fig = go.Figure()
fig.add_trace(go.Box(y=y0, name='Sample A',
marker_color = 'indianred'))
fig.add_trace(go.Box(y=y1, name = 'Sample B',
marker_color = 'lightseagreen'))
fig.show()
```
fig.show()
```

159
one_md_per_day_format/piscine/Week1/day4.md

@ -1,48 +1,44 @@
# D04 Piscine AI - Data Science
# D04 Piscine AI - Data Science
Author:
Author:
# Table of Contents:
Historical part:
Data wrangling, unify source of data ...
# Introduction
Historical part:
Data wrangling, unify source of data ...
# Introduction
...
## Ressources
Pandas website
- https://jakevdp.github.io/PythonDataScienceHandbook/
...
https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
## Resources
Pandas website
https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/
- https://jakevdp.github.io/PythonDataScienceHandbook/
https://towardsdatascience.com/different-ways-to-iterate-over-rows-in-a-pandas-dataframe-performance-comparison-dc0d5dcef8fe
- https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
- https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/
- https://towardsdatascience.com/different-ways-to-iterate-over-rows-in-a-pandas-dataframe-performance-comparison-dc0d5dcef8fe
# Exercice 1 Concatenate
# Exercise 1 Concatenate
The goal of this exercice is to learn to concatenate DataFrames. The logic is the same for the Series.
The goal of this exercise is to learn to concatenate DataFrames. The logic is the same for the Series.
Here are the two DataFrames to concatenate:
```
```python
df1 = pd.DataFrame([['a', 1], ['b', 2]],
columns=['letter', 'number'])
df2 = pd.DataFrame([['c', 1], ['d', 2]],
columns=['letter', 'number'])
```
1. Concatenate this two DataFrames on index axis and reset the index. The index of the outputted should be `RangeIndex(start=0, stop=4, step=1)`. **Do not change the index manually**.
## Correction
1. This question is validated if the outputted DataFrame is:
@ -54,15 +50,14 @@ df2 = pd.DataFrame([['c', 1], ['d', 2]],
| 2 | c | 1 |
| 3 | d | 2 |
# Exercise 2 Merge
# Exercice 2 Merge
The goal of this exercice is to learn to merge DataFrames
The logic of merging DataFrames in Pandas is quite similar as the one used in SQL.
The goal of this exercise is to learn to merge DataFrames
The logic of merging DataFrames in Pandas is quite similar as the one used in SQL.
Here are the two DataFrames to merge:
```
```python
#df1
df1_dict = {
@ -80,6 +75,7 @@ df2_dict = {
df2 = pd.DataFrame(df2_dict, columns = ['id', 'Feature1', 'Feature2'])
```
1. Merge the two DataFrames to get this output:
| | id | Feature1_x | Feature2_x | Feature1_y | Feature2_y |
@ -87,7 +83,7 @@ df2 = pd.DataFrame(df2_dict, columns = ['id', 'Feature1', 'Feature2'])
| 0 | 1 | A | B | K | L |
| 1 | 2 | C | D | M | N |
2. Merge the two DataFrames to get this output:
2. Merge the two DataFrames to get this output:
| | id | Feature1_df1 | Feature2_df1 | Feature1_df2 | Feature2_df2 |
|---:|-----:|:---------------|:---------------|:---------------|:---------------|
@ -100,16 +96,16 @@ df2 = pd.DataFrame(df2_dict, columns = ['id', 'Feature1', 'Feature2'])
| 6 | 7 | nan | nan | Q | R |
| 7 | 8 | nan | nan | S | T |
## Correction
## Corrections
1. This question is validated if the output is:
1. This question is validated if the output is:
| | id | Feature1_x | Feature2_x | Feature1_y | Feature2_y |
|---:|-----:|:-------------|:-------------|:-------------|:-------------|
| 0 | 1 | A | B | K | L |
| 1 | 2 | C | D | M | N |
2. This question is validated if the output is:
2. This question is validated if the output is:
| | id | Feature1_df1 | Feature2_df1 | Feature1_df2 | Feature2_df2 |
|---:|-----:|:---------------|:---------------|:---------------|:---------------|
@ -122,17 +118,16 @@ df2 = pd.DataFrame(df2_dict, columns = ['id', 'Feature1', 'Feature2'])
| 6 | 7 | nan | nan | Q | R |
| 7 | 8 | nan | nan | S | T |
Note: Check that the suffixes are set using the suffix parameters rather than manually changing the columns' name.
Note: Check that the suffixes are set using the suffix parameters rather than manually changing the columns' name.
## Exercice 3 Merge MultiIndex
## Exercise 3 Merge MultiIndex
The goal of this exercice is to learn to merge DataFrames with MultiIndex.
Use the code below to generate the DataFrames. `market_data` contains fake market data. In finance, the market is available during the trading days (business days). `alternative_data` contains fake alternative data from social media. This data is available every day. But, for some reasons the Data Engineer lost the last 15 days of alternative data.
The goal of this exercise is to learn to merge DataFrames with MultiIndex.
Use the code below to generate the DataFrames. `market_data` contains fake market data. In finance, the market is available during the trading days (business days). `alternative_data` contains fake alternative data from social media. This data is available every day. But, for some reasons the Data Engineer lost the last 15 days of alternative data.
1. Using `market_data` as the reference, merge `alternative_data` on `market_data`
```
```python
#generate days
all_dates = pd.date_range('2021-01-01', '2021-12-15')
business_dates = pd.bdate_range('2021-01-01', '2021-12-31')
@ -152,20 +147,17 @@ Use the code below to generate the DataFrames. `market_data` contains fake marke
alternative_data = pd.DataFrame(index=index_alt,
data=np.random.randn(len(index_alt), 2),
columns=['Twitter','Reddit'])
```
`reset_index` is not allowed for this question
2. Fill missing values with 0
https://medium.com/swlh/merging-dataframes-with-pandas-pd-merge-7764c7e2d46d
2. Fill missing values with 0
- https://medium.com/swlh/merging-dataframes-with-pandas-pd-merge-7764c7e2d46d
## Correction
## Correction
1. This question is validated if the outputted DataFrame's shape is `(1305, 5)` and if `merged.head()` returns:
1. This question is validated if the outputted DataFrame's shape is `(1305, 5)` and if `merged.head()` returns a similar table:
| | Open | Close | Close_Adjusted | Twitter | Reddit |
|:-----------------------------------------------------|-----------:|----------:|-----------------:|------------:|----------:|
@ -175,23 +167,21 @@ https://medium.com/swlh/merging-dataframes-with-pandas-pd-merge-7764c7e2d46d
| (Timestamp('2021-01-01 00:00:00', freq='B'), 'AMZN') | 1.06324 | 0.841241 | -0.799481 | -0.805677 | 0.511769 |
| (Timestamp('2021-01-01 00:00:00', freq='B'), 'DAI') | -0.603453 | -2.06141 | -0.969064 | 1.49817 | 0.730055 |
One of the answers that returns the correct DataFrame is:
One of the answers that returns the correct DataFrame is:
`market_data.merge(alternative_data, how='left', left_index=True, right_index=True)`
2. This question is validated if the number of missing in the DataFrame is equal to 0 and if `filled_df.sum().sum() == merged_df.sum().sum()` gives: `True`
2. This question is validated if the numbers that are missing in the DataFrame are equal to 0 and if `filled_df.sum().sum() == merged_df.sum().sum()` gives: `True`
# Exercice 4 Groupby Apply
# Exercise 4 Groupby Apply
The goal of this exercice is to learn to group the data and apply a function on the groups.
The use case we will work on is computing
The goal of this exercise is to learn to group the data and apply a function on the groups.
The use case we will work on is computing
1. Create a function that uses `pandas.DataFrame.clip` and that replace extreme values by a given percentile. The values that are greater than the upper percentile 80% are replaced by the percentile 80%. The values that are smaller than the lower percentile 20% are replaced by the percentile 20%. This process that correct outliers is called **winsorizing**.
I recommend to use NumPy to compute the percentiles to make sure we used the same defaut parameters.
1. Create a function that uses `pandas.DataFrame.clip` and that replace extreme values by a given percentile. The values that are greater than the upper percentile 80% are replaced by the percentile 80%. The values that are smaller than the lower percentile 20% are replaced by the percentile 20%. This process that correct outliers is called **winsorizing**.
I recommend to use NumPy to compute the percentiles to make sure we used the same default parameters.
```
```python
def winsorize(df, quantiles):
"""
df: pd.DataFrame
@ -201,15 +191,15 @@ I recommend to use NumPy to compute the percentiles to make sure we used the sam
#TODO
return
```
Here is what the function should output:
```
Here is what the function should output:
```python
df = pd.DataFrame(range(1,11), columns=['sequence'])
print(winsorize(df, [0.20, 0.80]).to_markdown())
```
| | sequence |
|---:|-----------:|
| 0 | 2.8 |
@ -223,16 +213,16 @@ I recommend to use NumPy to compute the percentiles to make sure we used the sam
| 8 | 8.2 |
| 9 | 8.2 |
2. Now we consider that each value belongs to a group. The goal is to apply the **winsorizing to each group**. In this question we use winsorizing values that are common: `[0.05,0.95]` as percentiles. Here is the new data set:
2. Now we consider that each value belongs to a group. The goal is to apply the **winsorizing to each group**. In this question we use winsorizing values that are common: `[0.05,0.95]` as percentiles. Here is the new data set:
```
```python
groups = np.concatenate([np.ones(10), np.ones(10)+1, np.ones(10)+2, np.ones(10)+3, np.ones(10)+4])
df = pd.DataFrame(data= zip(groups,
range(1,51)),
columns=["group", "sequence"])
```
The expected output (first rows) is:
| | sequence |
@ -249,19 +239,17 @@ I recommend to use NumPy to compute the percentiles to make sure we used the sam
| 9 | 9.55 |
| 10 | 11.45 |
## Correction
The for loop is forbidden in this exercice. The goal is to use `groupby` and `apply`.
1. This question is validated if the output is:
The for loop is forbidden in this exercise. The goal is to use `groupby` and `apply`.
```
1. This question is validated if the output is:
```python
df = pd.DataFrame(range(1,11), columns=['sequence'])
print(winsorize(df, [0.20, 0.80]).to_markdown())
```
| | sequence |
|---:|-----------:|
| 0 | 2.8 |
@ -275,10 +263,9 @@ The for loop is forbidden in this exercice. The goal is to use `groupby` and `ap
| 8 | 8.2 |
| 9 | 8.2 |
2. This question is validated if the output is the same as the one returned by:
2. This question is validated if the output is the same as the one returned by:
```
```python
def winsorize(df_series, quantiles):
"""
df: pd.DataFrame or pd.Series
@ -293,7 +280,8 @@ The for loop is forbidden in this exercice. The goal is to use `groupby` and `ap
df.groupby("group")[['sequence']].apply(winsorize, [0.05,0.95])
```
The ouput can also be a Series instead of a DataFrame.
The output can also be a Series instead of a DataFrame.
The expected output (first rows) is:
@ -309,15 +297,13 @@ The for loop is forbidden in this exercice. The goal is to use `groupby` and `ap
| 7 | 8 |
| 8 | 9 |
| 9 | 9.55 |
| 10 | 11.45 |
https://towardsdatascience.com/how-to-use-the-split-apply-combine-strategy-in-pandas-groupby-29e0eb44b62e
| 10 | 11.45 |
- https://towardsdatascience.com/how-to-use-the-split-apply-combine-strategy-in-pandas-groupby-29e0eb44b62e
# Exercice 5 Groupby Agg
# Exercise 5 Groupby Agg
The goal of this exercice is to learn to compute different type of agregations on the groups. This small DataFrame contains products and prices.
The goal of this exercise is to learn to compute different type of aggregations on the groups. This small DataFrame contains products and prices.
| | value | product |
|---:|--------:|:-------------|
@ -329,7 +315,7 @@ The goal of this exercice is to learn to compute different type of agregations o
| 5 | 100 | mobile phone |
| 6 | 99.99 | table |
1. Compute the min, max and mean price for each product in one single line of code. The expected output is:
1. Compute the min, max and mean price for each product in one single line of code. The expected output is:
| product | ('value', 'min') | ('value', 'max') | ('value', 'mean') |
|:-------------|-------------------:|-------------------:|--------------------:|
@ -341,7 +327,7 @@ Note: The columns don't have to be MultiIndex
## Correction
1. The question is validated if the output is:
1. The question is validated if the output is:
| product | ('value', 'min') | ('value', 'max') | ('value', 'mean') |
|:-------------|-------------------:|-------------------:|--------------------:|
@ -353,12 +339,12 @@ Note: The columns don't have to be MultiIndex
My answer is: `df.groupby('product').agg({'value':['min','max','mean']})`
# Exercice 6 Unstack
# Exercise 6 Unstack
The goal of this exercice is to learn to unstack a MultiIndex.
Let's assume we trained a machine learning model that predicts a daily score on the companies (tickers) below. It may be very useful to unstack the MultiIndex: plot the time series, vectorize the backtest etc ...
The goal of this exercise is to learn to unstack a MultiIndex
Let's assume we trained a machine learning model that predicts a daily score on the companies (tickers) below. It may be very useful to unstack the MultiIndex: plot the time series, vectorize the backtest, ...
```
```python
business_dates = pd.bdate_range('2021-01-01', '2021-12-31')
#generate tickers
@ -373,7 +359,8 @@ market_data = pd.DataFrame(index=index,
columns=['Prediction'])
```
1. Unstack the DataFrame.
1. Unstack the DataFrame.
The first 3 rows of the DataFrame should like this:
@ -383,13 +370,11 @@ The first 3 rows of the DataFrame should like this:
| 2021-01-04 00:00:00 | -0.560953 | 0.503199 | -0.79517 | -3.23136 | 1.50271 |
| 2021-01-05 00:00:00 | 0.211489 | 1.84867 | 0.287906 | -1.81119 | 1.20321 |
2. Plot the 5 times series in the same plot using Pandas built-in visualization functions with a title.
2. Plot the 5 times series in the same plot using Pandas built-in visualisation functions with a title.
## Correction
## Correction
1. This questions is validated is the output of `unstacked_df.head()` is
1. This questions is validated is the output is similar to `unstacked_df.head()`:
| Date | ('Prediction', 'AAPL') | ('Prediction', 'AMZN') | ('Prediction', 'DAI') | ('Prediction', 'FB') | ('Prediction', 'GE') |
|:--------------------|-------------------------:|-------------------------:|------------------------:|-----------------------:|-----------------------:|
@ -397,6 +382,4 @@ The first 3 rows of the DataFrame should like this:
| 2021-01-04 00:00:00 | -0.560953 | 0.503199 | -0.79517 | -3.23136 | 1.50271 |
| 2021-01-05 00:00:00 | 0.211489 | 1.84867 | 0.287906 | -1.81119 | 1.20321 |
2. The question is validated if the answer is: `unstacked.plot(title = 'Stocks 2021')`. The title can be anything else.
2. The question is validated if the answer is: `unstacked.plot(title = 'Stocks 2021')`. The title can be anything else.

20
one_md_per_day_format/piscine/Week1/day5.md

@ -31,9 +31,9 @@ https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/
# Exercice 1
# Exercise 1
The goal of this exercice is to learn to manipulate time series in Pandas.
The goal of this exercise is to learn to manipulate time series in Pandas.
1. Create a `Series` named `integer_series`from 1st January 2010 to 31 December 2020. At each date is associated the number of days since 1st January 2010. It starts with 0.
@ -79,9 +79,9 @@ The goal of this exercice is to learn to manipulate time series in Pandas.
```
If the `NaN` values have been dropped the solution is also accepted. The solution uses `rolling().mean()`.
# Exercice 2
# Exercise 2
The goal of this exercice is to learn to use Pandas on Time Series an on Financial data.
The goal of this exercise is to learn to use Pandas on Time Series an on Financial data.
The data we will use is Apple stock.
@ -144,11 +144,11 @@ To get this result there are two ways: `resample` and `groupby`. There are two k
Name: Open, Length: 10118, dtype: float64
```
- The first way is to compute the return without for loop is to use `pct_change`
- The second way to compute the return without for loop is to implement the formula given in the exercice in a vectorized way. To get the value at `t-1` you can use `shift`
- The second way to compute the return without for loop is to implement the formula given in the exercise in a vectorized way. To get the value at `t-1` you can use `shift`
# Exercice 3 Multi asset returns
# Exercise 3 Multi asset returns
The goal of this exercice is to learn to compute daily returns on a DataFrame that contains many assets (multi-assets).
The goal of this exercise is to learn to compute daily returns on a DataFrame that contains many assets (multi-assets).
```
business_dates = pd.bdate_range('2021-01-01', '2021-12-31')
@ -187,9 +187,9 @@ Note: The data is generated randomly, the values you may have a different result
The DataFrame contains random data. Make sure your output and the one returned by this code is based on the same DataFrame.
# Exercice 4 Backtest
# Exercise 4 Backtest
The goal of this exercice is to learn to perform a backtest in Pandas. A backtest is a tool that allows you to know how a strategy would have performed retrospectively using historical data. In this exercice we will focus on the backtesting tool and not on how to build the best strategy.
The goal of this exercise is to learn to perform a backtest in Pandas. A backtest is a tool that allows you to know how a strategy would have performed retrospectively using historical data. In this exercise we will focus on the backtesting tool and not on how to build the best strategy.
We will backtest a **long only** strategy on Apple Inc. Long only means that we only consider buying the stock. The input signal at date d says if the close price will increase at d+1. We assume that the input signal is available before the market closes.
@ -266,7 +266,7 @@ My results can be reproduced using: `np.random.seed = 2712`. Given the versions
Name: Daily_futur_returns, Length: 10118, dtype: float64
```
The answer is also accepted if the returns is computed as in the exercice 2 and then shifted in the futur using `shift`, but I do not recommend this implementation as it adds missing values !
The answer is also accepted if the returns is computed as in the exercise 2 and then shifted in the futur using `shift`, but I do not recommend this implementation as it adds missing values !
An example of solution is:

32
one_md_per_day_format/piscine/Week2/day03.md

@ -35,9 +35,9 @@ This object takes as input the preprocessing transforms and a Machine Learning m
## Ressources
TODO
# Exercice 1 Imputer 1
# Exercise 1 Imputer 1
The goal of this exercice is to learn how to use an Imputer to fill missing values on basic example.
The goal of this exercise is to learn how to use an Imputer to fill missing values on basic example.
```
train_data = [[7, 6, 5],
@ -84,11 +84,11 @@ test_data = [[np.nan, 1, 2],
[ 4., 2., 4.]])
```
# Exercice 2 Scaler
# Exercise 2 Scaler
The goal of this exercice is to learn to scale a data set. There are various scaling techniques, we will focus on `StandardScaler` from scikit learn.
The goal of this exercise is to learn to scale a data set. There are various scaling techniques, we will focus on `StandardScaler` from scikit learn.
We will use a tiny data set for this exercice that we will generate by ourselves:
We will use a tiny data set for this exercise that we will generate by ourselves:
```
X_train = np.array([[ 1., -1., 2.],
@ -140,8 +140,8 @@ array([[ 1.22474487, -1.22474487, 0.53452248],
[ 0. , 1.22474487, 0.53452248]])
```
# Exercice 3 One hot Encoder
The goal of this exercice is to learn how to deal with Categorical variables using the OneHot Encoder.
# Exercise 3 One hot Encoder
The goal of this exercise is to learn how to deal with Categorical variables using the OneHot Encoder.
```
X_train = [['Python'], ['Java'], ['Java'], ['C++']]
@ -199,8 +199,8 @@ https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEn
# Exercice 4 Ordinal Encoder
The goal of this exercice is to learn how to deal with Categorical variables using the Ordinal Encoder.
# Exercise 4 Ordinal Encoder
The goal of this exercise is to learn how to deal with Categorical variables using the Ordinal Encoder.
In that case, we want the model to consider that: **good > neutral > bad**
@ -242,9 +242,9 @@ array([[2.],
# Exercice 5 Categorical variables
# Exercise 5 Categorical variables
The goal of this exercice is to learn how to deal with Categorical variables with Ordinal Encoder, Label Encoder and OneHot Encoder.
The goal of this exercise is to learn how to deal with Categorical variables with Ordinal Encoder, Label Encoder and OneHot Encoder.
Preliminary:
- Load the breast-cancer.csv file
@ -359,7 +359,7 @@ AttributeError: Transformer ordinalencoder (type OrdinalEncoder) does not provid
```
**It means that if you want to use the Ordinal Encoder, you will have to create a variable that contains the columns name in the right order. This step is not required in that exercice**
**It means that if you want to use the Ordinal Encoder, you will have to create a variable that contains the columns name in the right order. This step is not required in that exercise**
@ -438,9 +438,9 @@ array([[1., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0., 1., 0., 2., 2., 0.,
```
# Exercice 6 Pipeline
# Exercise 6 Pipeline
The goal of this exercice is to learn to use the Scikit-learn object: Pipeline. The data set: used for this exercice is the `iris` data set.
The goal of this exercise is to learn to use the Scikit-learn object: Pipeline. The data set: used for this exercise is the `iris` data set.
Preliminary:
- Run the code below.
@ -513,9 +513,9 @@ On financial data set, the ratio signal to noise is low. Trying to forecast stoc
# Exercice 1 Imputer 2
# Exercise 1 Imputer 2
The goal of this exercice is to learn how to use an Imputer to fill missing values in the data set.
The goal of this exercise is to learn how to use an Imputer to fill missing values in the data set.
**Reminder**: The data exploration should be done first. It tells which rows/variables should be removed because there are too many missing values. Then the remaining data points can be treated using an Imputer.

22
one_md_per_day_format/piscine/Week2/day05.md

@ -6,12 +6,12 @@
# Introduction
If you finished yesterday's exercices you should be able to train several Machine Learning algorithms and to choose one returned by GridSearchCV.
If you finished yesterday's exercises you should be able to train several Machine Learning algorithms and to choose one returned by GridSearchCV.
GridSearchCV returns the model that gives the best score on the test set. Yesterday, as I told you, I changed the **cv** parameter to compute the GridSearch with a train set and a test set.
It means that the selected model is based on one single measure. What if, by luck, we predict correctly on that section ? What if the best model is bad ? What if I could have selected a better model ?
We will answer these questions today ! The topics we will cover are the one of the most important in Machine Learning.
Must read before to start the exercices:
Must read before to start the exercises:
- Biais-Variance trade off; aka Underfitting/Overfitting.
- https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine-learning/
@ -28,9 +28,9 @@ Must read before to start the exercices:
## Ressources
# Exercice 1: K-Fold
# Exercise 1: K-Fold
The goal of this exercice is to learn to use `KFold` to split the data set in a k-fold cross validation. Most of the time you won't use this function to split your data because this function is used by others as `cross_val_score` or `cross_validate` or `GridSearchCV` ... . But, this allows to understand the splitting and to create a custom one if needed.
The goal of this exercise is to learn to use `KFold` to split the data set in a k-fold cross validation. Most of the time you won't use this function to split your data because this function is used by others as `cross_val_score` or `cross_validate` or `GridSearchCV` ... . But, this allows to understand the splitting and to create a custom one if needed.
```
X = np.array(np.arange(1,21).reshape(10,-1))
@ -81,9 +81,9 @@ y = np.array(np.arange(1,11))
# Exercice 2: Cross validation (k-fold)
# Exercise 2: Cross validation (k-fold)
The goal of this exercice is to learn how to use cross validation. After reading the articles you should be able to explain why we need to cross-validate the models. We will firstly focus on Linear Regression to reduce the computation time. We will be using `cross_validate` to run the cross validation. Note that `cross_val_score` is similar bu the `cross_validate` calculates one or more scores and timings for each CV split.
The goal of this exercise is to learn how to use cross validation. After reading the articles you should be able to explain why we need to cross-validate the models. We will firstly focus on Linear Regression to reduce the computation time. We will be using `cross_validate` to run the cross validation. Note that `cross_val_score` is similar bu the `cross_validate` calculates one or more scores and timings for each CV split.
Preliminary:
@ -159,9 +159,9 @@ The model is consistent across folds: it is stable. That's a first sign that the
# Exercice 3 GridsearchCV
# Exercise 3 GridsearchCV
The goal of this exercice is to learn to use GridSearchCV to run a grid search, predict on the test set and score on the test set.
The goal of this exercise is to learn to use GridSearchCV to run a grid search, predict on the test set and score on the test set.
Preliminary:
@ -250,13 +250,13 @@ WARNING: If the score used in classification is the AUC, there is one rare case
# Exercice 5 Validation curve and Learning curve
# Exercise 5 Validation curve and Learning curve
The goal of this exercice is to learn to analyse the models' performance with two tools:
The goal of this exercise is to learn to analyse the models' performance with two tools:
- Validation curve
- Learning curve
For this exercice we will use a dataset of 100k data points to give you an idea of the computation time you can expect during projects.
For this exercise we will use a dataset of 100k data points to give you an idea of the computation time you can expect during projects.
Preliminary:

361
one_md_per_day_format/piscine/Week2/day1.md

@ -1,13 +1,12 @@
# W2D01 Piscine AI - Data Science
# W2D01 Piscine AI - Data Science
The goal of this day is to understand practical Linear regression and supervised learning.
The goal of this day is to understand practical Linear regression and supervised learning.
Author:
Author:
# Table of Contents:
Historical part:
# Table of Contents
Historical part:
# Introduction
@ -16,30 +15,33 @@ studied the size of individuals within a progeny. He was trying to understand wh
large individuals in a population appeared to have smaller children, more
close to the average population size; hence the introduction of the term "regression".
Today we will learn a basic algorithm used in **supervised learning** : **The Linear Regression**. We will be using **Scikit-learn** which is a machine learning library. It is designed to to interoperate with the Python libraries NumPy and Pandas.
We will also learn progressively the Machine Learning methodology for supervised learning - today we will focus on evalutatig a machine learning model by splitting the data set in a train set and a test set.
Today we will learn a basic algorithm used in **supervised learning** : **The Linear Regression**. We will be using **Scikit-learn** which is a machine learning library. It is designed to interoperate with the Python libraries NumPy and Pandas.
We will also learn progressively the Machine Learning methodology for supervised learning - today we will focus on evaluating a machine learning model by splitting the data set in a train set and a test set.
'0.22.1'
## Rules
## Ressources
### To start with Scikit-learn:
## Ressources
### To start with Scikit-learn
- https://scikit-learn.org/stable/tutorial/basic/tutorial.html
- https://jakevdp.github.io/PythonDataScienceHandbook/05.02-introducing-scikit-learn.html
https://scikit-learn.org/stable/modules/linear_model.html
- https://scikit-learn.org/stable/modules/linear_model.html
### Machine learning methodology and algorithms:
### Machine learning methodology and algorithms
- This course provides a broad introduction to machine learning, datamining, and statistical pattern recognition. Andrew Ng is a star in the Machine Learning community. I recommend to spend some time during the projects to focus on some algorithms. However, Python is not the langage used for the course. https://www.coursera.org/learn/machine-learning
- This course provides a broad introduction to machine learning, datamining, and statistical pattern recognition. Andrew Ng is a star in the Machine Learning community. I recommend to spend some time during the projects to focus on some algorithms. However, Python is not the language used for the course. https://www.coursera.org/learn/machine-learning
- https://docs.microsoft.com/en-us/azure/machine-learning/algorithm-cheat-sheet
https://scikit-learn.org/stable/tutorial/index.html
- https://scikit-learn.org/stable/tutorial/index.html
### Linear Regression
### Linear Regression
- https://towardsdatascience.com/laymans-introduction-to-linear-regression-8b334a3dab09
@ -48,78 +50,76 @@ https://scikit-learn.org/stable/tutorial/index.html
### Train test split
- https://machinelearningmastery.com/train-test-split-for-evaluating-machine-learning-algorithms/
- https://developers.google.com/machine-learning/crash-course/training-and-test-sets/video-lecture?hl=en
# Exercice 1 Scikit-learn estimator
- https://developers.google.com/machine-learning/crash-course/training-and-test-sets/video-lecture?hl=en
The goal of this exercice is to learn to fit a Scikit-learn estimator and use it to predict.
# Exercise 1 Scikit-learn estimator
```
The goal of this exercise is to learn to fit a Scikit-learn estimator and use it to predict.
```console
X, y = [[1],[2.1],[3]], [[1],[2],[3]]
```
```
1. Fit a LinearRegression from Scikit-learn with X the features and y the target.
1. Fit a LinearRegression from Scikit-learn with X the features and y the target.
2. Predict for `x_pred = [[4]]`
3. Print the coefficients (`coefs_`) and the intercept (`intercept_`), the score (`score`)of the regression of X and y.
## Correction
3. Print the coefficients (`coefs_`) and the intercept (`intercept_`), the score (`score`) of the regression of X and y.
## Correction
1. This question is validated if the ouput of the fitted model is:
```
1. This question is validated if the output of the fitted model is:
```python
LinearRegression(copy_X=True, fit_intercept=[[1], [2.1], [3]], n_jobs=None,
normalize=[[1], [2], [3]])
```
2. This question is validated if the ouput is:
2. This question is validated if the output is:
```
```python
array([[3.96013289]])
```
3. This question is validated if the ouptut is:
```
3. This question is validated if the output is:
```output
Coefficients: [[0.99667774]]
Intercept: [-0.02657807]
Score: 0.9966777408637874
```
# Exercise 2 Linear regression in 1D
# Exercice 2 Linear regression in 1D
The goal of this exercice is to understand how the linear regression works in one dimension. To do so, we will generate a data in one dimension. Using `make regression` from Scikit-learn, generate a data set with 100 observations:
The goal of this exercise is to understand how the linear regression works in one dimension. To do so, we will generate a data in one dimension. Using `make regression` from Scikit-learn, generate a data set with 100 observations:
```
X, y, coef = make_regression(n_samples=100,
n_features=1,
n_informative=1,
noise=10,
coef=True,
random_state=0,
bias=100.0)
```
```python
X, y, coef = make_regression(n_samples=100,
n_features=1,
n_informative=1,
noise=10,
coef=True,
random_state=0,
bias=100.0)
```
1. Plot the data using matplotlib. The plot should look like this:
1. Plot the data using matplotlib. The plot should look like this:
![alt text][q1]
[q1]: images/day1/ex2/w2_day1_ex2_q1.png "Scatter plot"
2. Fit a LinearRegression from Scikit-learn on the generated data and give the equation of the fitted line. The expected output is: `y = coef * x + intercept`
3. Add the fitted line to the plot. the plot should look like this:
2. Fit a LinearRegression from Scikit-learn on the generated data and give the equation of the fitted line. The expected output is: `y = coef * x + intercept`
3. Add the fitted line to the plot. the plot should look like this:
![alt text][q3]
[q3]: images/day1/ex2/w2_day1_ex2_q3.png "Scatter plot + fitted line"
4. Predict on X
5. Create a function that computes the Mean Squared Error (MSE) and compute the MSE on the data set. *The MSE is frequently used as well as other regression metrics that will be studied later this week.*
```
def compute_mse(y_true, y_pred):
@ -129,23 +129,21 @@ The goal of this exercice is to understand how the linear regression works in on
Change the `noise` parameter of `make_regression` to 50
6. Repeat question 2, 4 and compute the MSE on the new data.
6. Repeat question 2, 4 and compute the MSE on the new data.
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html
## Correction
1. This question is validated if the plot looks like:
1. This question is validated if the plot looks like:
![alt text][q1]
[q1]: images/day1/ex2/w2_day1_ex2_q1.png "Scatter plot"
2. This question is validated if the equation of the fitted line is: `y = 42.619430291366946 * x + 99.18581817296929
`
2. This question is validated if the equation of the fitted line is: `y = 42.619430291366946 * x + 99.18581817296929`
3. This question is validated if the plot looks like:
3. This question is validated if the plot looks like:
![alt text][q3]
@ -153,38 +151,39 @@ https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_e
4. This question is validated if the outputted prediction for the first 10 values are:
```
```python
array([ 83.86186727, 140.80961751, 116.3333897 , 64.52998689,
61.34889539, 118.10301628, 57.5347917 , 117.44107847,
108.06237908, 85.90762675])
```
5. This question is validated if the MSE returned is `114.17148616819485`
6. This question is validated if the MSE returned is `2854.2871542048706`
# Exercice 3: Train test split
# Exercise 3: Train test split
The goal of this exercice is to learn to split a data set. It is important to understand why we split the data in two sets. To put it in a nutshell: the Machine Learning algorithms learns on the training data and is evaluated on the that it hasn't seen before: the testing data.
The goal of this exercise is to learn to split a data set. It is important to understand why we split the data in two sets. To put it in a nutshell: the Machine Learning model learns on the training data and evaluates on the data the model hasn't seen before: the testing data.
This video gives a basic and nice explanation: https://www.youtube.com/watch?v=_vdMKioCXqQ
This article explains the conditions to split the data and how to split it: https://machinelearningmastery.com/train-test-split-for-evaluating-machine-learning-algorithms/
```
X = np.arange(1,21).reshape(10,-1)
y = np.arange(1,11)
```
1. Split the data using `train_test_split` with `shuffle=False`. The test set represents 20% of the total size of the data set. Print X_train, y_train, X_test, y_test.
```python
X = np.arange(1,21).reshape(10,-1)
y = np.arange(1,11)
```
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
1. Split the data using `train_test_split` with `shuffle=False`. The test set represents 20% of the total size of the data set. Print X_train, y_train, X_test, y_test.
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
## Correction
1. This question is validated if X_train, y_train, X_test, y_test match this output:
```
X_train:
```console
X_train:
[[ 1 2]
[ 3 4]
[ 5 6]
@ -195,52 +194,50 @@ X_train:
[15 16]]
y_train:
y_train:
[1 2 3 4 5 6 7 8]
X_test:
X_test:
[[17 18]
[19 20]]
y_test:
y_test:
[ 9 10]
```
# Exercice 4 Forecast diabetes progression
# Exercise 4 Forecast diabetes progression
The goal of this exercise is to use Linear Regression to forecast the progression of diabetes. It will not always be precised, you should **ALWAYS** start doing an exploratory data analysis in order to have a good understanding of the data you model. As a reminder here an introduction to EDA:
The goal of this exercice is to use Linear Regression to forecast the progression of diabetes. It will not always be precised, you should **ALWAYS** start doing an exploratory data analysis in order to have a good understanding of the data you model. As a reminder here an introduction to EDA:
https://towardsdatascience.com/exploratory-data-analysis-eda-a-practical-guide-and-template-for-structured-data-abfbf3ee3bd9
- https://towardsdatascience.com/exploratory-data-analysis-eda-a-practical-guide-and-template-for-structured-data-abfbf3ee3bd9
The data set used is described in https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.
```
```python
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target
```
1. Using `train_test_split`, split the data set in a train set and test set (20%). Use `random_state=43` for results reproducibility.
2. Fit the Linear Regression on all the variables. Give the coefficients and the intercept of the Linear Regression. What is then the equation ?
3. Predict on the test set. Predicting on the test set is like having new patients for who, as a physician, need to forecast the disease progression in one year given the 10 baseline variables.
4. Compute the MSE on the train set and test set. Later this week we will learn about the R2 which will help us to evaluate the performance of this fitted Linear Regression. The MSE returns an arbitrary value depending on the range of error.
1. Using `train_test_split`, split the data set in a train set, and test set (20%). Use `random_state=43` for results reproducibility.
2. Fit the Linear Regression on all the variables. Give the coefficients and the intercept of the Linear Regression. What is the the equation ?
WARNING: This will be explained later this week. But here, we are doing something "dangerous". As you may have read in the data documentation the data is scaled using the whole dataset whereas we should first scale the data on the training set and then use this scaling on the test set. This is a toy example, so let's ingore this detail for now.
3. Predict on the test set. Predicting on the test set is like having new patients for who, as a physician, need to forecast the disease progression in one year given the 10 baseline variables.
4. Compute the MSE on the train set and test set. Later this week we will learn about the R2 which will help us to evaluate the performance of this fitted Linear Regression. The MSE returns an arbitrary value depending on the range of error.
**WARNING**: This will be explained later this week. But here, we are doing something "dangerous". As you may have read in the data documentation the data is scaled using the whole dataset whereas we should first scale the data on the training set and then use this scaling on the test set. This is a toy example, so let's ignore this detail for now.
https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset
## Correction
## Correction
1. This question is validated if the output of `y_train.values[:10]` and `y_test.values[:10]`are:
```
```console
y_train.values[:10]:
[[202.]
[ 55.]
@ -264,11 +261,11 @@ https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset
[ 78.]
[ 66.]
[192.]]
```
2. This question is validated if the coefficients and the intercept are:
```
```console
[('age', -60.40163046086952),
('sex', -226.08740652083418),
('bmi', 529.383623302316),
@ -282,9 +279,9 @@ https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset
('intercept', 152.05314895029233)]
```
3. This question is validated if the output of `predictions_on_test[:10]` is:
3. This question is validated if the output of `predictions_on_test[:10]` is:
```
```console
array([[111.74351759],
[ 98.41335251],
[168.36373195],
@ -295,141 +292,135 @@ https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset
[126.28961941],
[117.73121787],
[224.83346984]])
```
4. This question is validated if the mse on the **train set** is `2888.326888` and the mse on the **test set** is `2858.255153`.
4. This question is validated if the mse on the **train set** is `2888.326888` and the mse on the **test set** is `2858.255153`.
## Exercice 5 Gradient Descent
## Exercise 5 Gradient Descent - Optional
The goal of this exercice is to understand how the Linear Regression algorithm finds the optimal coefficients.
The goal of this exercise is to understand how the Linear Regression algorithm finds the optimal coefficients.
The goal is to fit a Linear Regression on a one dimensional features data **without using Scikit-learn**. Let's use the data set we generated for the exercice 1:
The goal is to fit a Linear Regression on a one dimensional features data **without using Scikit-learn**. Let's use the data set we generated for the exercise 2:
```python
X, y, coef = make_regression(n_samples=100,
n_features=1,
n_informative=1,
noise=10,
coef=True,
random_state=0,
bias=100.0)
```
```
X, y, coef = make_regression(n_samples=100,
n_features=1,
n_informative=1,
noise=10,
coef=True,
random_state=0,
bias=100.0)
```
*Warning: The shape of X is not the same as the shape of y. You may need (for some questions) to reshape X using: `X.reshape(1,-1)[0]`.*
1. Plot the data using matplotlib:
1. Plot the data using matplotlib:
![alt text][ex5q1]
[ex5q1]: images/day1/ex5/w2_day1_ex5_q1.png "Scatter plot "
As a reminder, fitting a Linear Regression on this data means finding (a,b) that fits well the data points.
As a reminder, fitting a Linear Regression on this data means finding (a,b) that fits well the data points.
- y_pred = a*x +b
- `y_pred = a*x +b`
Mathematically, it means finding (a,b) that minimizes the MSE, which is the loss used in Linear Regression. If we consider 3 data points:
- Loss(a,b) = MSE(a,b) =
1/3 *((y_pred1 - y_true1)**2 + (y_pred2 - y_true2)**2) + (y_pred3 - y_true3)**2)
- `Loss(a,b) = MSE(a,b) = 1/3 *((y_pred1 - y_true1)**2 + (y_pred2 - y_true2)**2) + (y_pred3 - y_true3)**2)`
and we know:
y_pred1 = a*x1 + b
y_pred2 = a*x2 + b
y_pred3 = a*x3 + b
and we know:
y_pred1 = a*x1 + b\
y_pred2 = a*x2 + b\
y_pred3 = a*x3 + b
### Greedy approach
2. Create a function `compute_mse`. Compute mse for `a = 1` and `b = 2`.
**Warning**: `X.shape` is `(100, 1)` and `y.shape` is `(100, )`. Make sure that `y_preds` and `y` have the same shape before to compute `y_preds-y`.
```
def compute_mse(coefs, X, y):
'''
coefs is a list that contains a and b: [a,b]
X is the features set
y is the target
```python
def compute_mse(coefs, X, y):
'''
coefs is a list that contains a and b: [a,b]
X is the features set
y is the target
Returns a float which is the MSE
'''
Returns a float which is the MSE
'''
#TODO
#TODO
y_preds =
mse =
y_preds =
mse =
return mse
```
3. Create a grid of **640000** points that combines a and b with. Check that the grid contains 640000 points.
- a between -200 and 200, step= 0.5
- b between -200 and 200, step= 0.5
return mse
```
This is how to compute the grid with the combination of a and b:
3. Create a grid of **640000** points that combines a and b with. Check that the grid contains 640000 points.
```
aa, bb = np.mgrid[-200:200:0.5, -200:200:0.5]
grid = np.c_[aa.ravel(), bb.ravel()]
- a between -200 and 200, step= 0.5
- b between -200 and 200, step= 0.5
```
This is how to compute the grid with the combination of a and b:
4. Compute the MSE for all points in the grid. If possible, parallelize the computations. It may be needed to use `functools.partial` to parallelize a function with many parameters on a list. Put the result in a variable named `losses`.
```python
aa, bb = np.mgrid[-200:200:0.5, -200:200:0.5]
grid = np.c_[aa.ravel(), bb.ravel()]
```
4. Compute the MSE for all points in the grid. If possible, parallelize the computations. It may be needed to use `functools.partial` to parallelize a function with many parameters on a list. Put the result in a variable named `losses`.
5. Use this chunk of code to plot the MSE in 2D:
```
aa, bb = np.mgrid[-200:200:.5, -200:200:.5]
grid = np.c_[aa.ravel(), bb.ravel()]
losses_reshaped = np.array(losses).reshape(aa.shape)
f, ax = plt.subplots(figsize=(8, 6))
contour = ax.contourf(aa,
bb,
losses_reshaped,
100,
cmap="RdBu",
vmin=0,
vmax=160000)
ax_c = f.colorbar(contour)
ax_c.set_label("MSE")
ax.set(aspect="equal",
xlim=(-200, 200),
ylim=(-200, 200),
xlabel="$a$",
ylabel="$b$")
```
The expected output is:
![alt text][ex5q5]
[ex5q5]: images/day1/ex5/w2_day1_ex5_q5.png "MSE "
```python
aa, bb = np.mgrid[-200:200:.5, -200:200:.5]
grid = np.c_[aa.ravel(), bb.ravel()]
losses_reshaped = np.array(losses).reshape(aa.shape)
f, ax = plt.subplots(figsize=(8, 6))
contour = ax.contourf(aa,
bb,
losses_reshaped,
100,
cmap="RdBu",
vmin=0,
vmax=160000)
ax_c = f.colorbar(contour)
ax_c.set_label("MSE")
ax.set(aspect="equal",
xlim=(-200, 200),
ylim=(-200, 200),
xlabel="$a$",
ylabel="$b$")
```
The expected output is:
6. From the `losses` list, find the optimal value of a and b and plot the line in the scatter point of question 1.
![alt text][ex5q5]
[ex5q5]: images/day1/ex5/w2_day1_ex5_q5.png "MSE "
6. From the `losses` list, find the optimal value of a and b and plot the line in the scatter point of question 1.
In this example we computed 160 000 times the MSE. It is frequent to deal with 50 features, which requires 51 parameters to fit the Linear Regression. If we try this approach with 50 features we would need to compute **5.07e+132** MSE. Even if we reduce the scope and try only 5 values per coefficients we would have to compute the MSE **4.4409e+35** times. This approach is not scalable and that is why is not used to find optimal coefficients for Linear Regression.
In this example we computed 160 000 times the MSE. It is frequent to deal with 50 features, which requires 51 parameters to fit the Linear Regression. If we try this approach with 50 features we would need to compute **5.07e+132** MSE. Even if we reduce the scope and try only 5 values per coefficients we would have to compute the MSE **4.4409e+35** times. This approach is not scalable and that is why is not used to find optimal coefficients for Linear Regression.
### Gradient Descent
### Gradient Descent
In a nutshel, Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. In machine learning, we use gradient descent to update the parameters (a and b) of our model. Parameters refer to the coefficients used in Linear Regression. Before to start implementing the questions, take the time to read the article. https://jairiidriss.medium.com/gradient-descent-algorithm-from-scratch-using-python-2b36c1548917. It explains the gradient descent and how to implement it. The "tricky" part is the computation of the derivative of the mse. You can admit the formulas of the derivatives to implement the gradient descent (`d_theta_0` and `d_theta_1` in the article).
In a nutshel, Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. In machine learning, we use gradient descent to update the parameters (a and b) of our model. Parameters refer to the coefficients used in Linear Regression. Before to start implementing the questions, take the time to read the article. https://jairiidriss.medium.com/gradient-descent-algorithm-from-scratch-using-python-2b36c1548917. It explains the gradient descent and how to implement it. The "tricky" part is the computation of the derivative of the mse. You can admit the formulas of the derivatives to implement the gradient descent (`d_theta_0` and `d_theta_1` in the article).
7. Implement the gradient descent to find optimal a and b with `learning rate = 0.1` and `nbr_iterations=100`.
8. Save the a and b through the iterations in a two dimensional numpy array. Add them to the plot of the previous part and observe a and b that converge towards the minimum. The plot should look like this:
8. Save the a and b through the iterations in a two dimensional numpy array. Add them to the plot of the previous part and observe a and b that converge towards the minimum. The plot should look like this:
![alt text][ex5q8]
[ex5q8]: images/day1/ex5/w2_day1_ex5_q8.png "MSE + Gradient descent"
9. Use Linear Regression from Scikit-learn. Compare the results.
9. Use Linear Regression from Scikit-learn. Compare the results.
## Correction
## Correction
1. This question is validated if the outputted plot looks like:
@ -441,13 +432,13 @@ In a nutshel, Gradient descent is an optimization algorithm used to minimize som
3. This question is validated if `grid.shape` is `(640000,2)`.
4. This question is validated if the 10 first values of losses are:
4. This question is validated if the 10 first values of losses are:
```
array([158315.41493175, 158001.96852692, 157689.02212209, 157376.57571726,
157064.62931244, 156753.18290761, 156442.23650278, 156131.79009795,
155821.84369312, 155512.39728829])
```
```console
array([158315.41493175, 158001.96852692, 157689.02212209, 157376.57571726,
157064.62931244, 156753.18290761, 156442.23650278, 156131.79009795,
155821.84369312, 155512.39728829])
```
5. This question is validated if the outputted plot looks like
@ -456,14 +447,14 @@ In a nutshel, Gradient descent is an optimization algorithm used to minimize som
[ex5q5]: images/day1/ex5/w2_day1_ex5_q5.png "MSE"
6. This question is validated if the point returned is
`array([42.5, 99. ])`. It means that `a= 42.5` and `b=99`.
`array([42.5, 99. ])`. It means that `a= 42.5` and `b=99`.
7. This question is validated if the coefficients returned are
```
Coefficients (a): 42.61943031121358
Intercept (b): 99.18581814447936
```
```console
Coefficients (a): 42.61943031121358
Intercept (b): 99.18581814447936
```
8. This question is validated if the outputted plot is
@ -471,11 +462,9 @@ In a nutshel, Gradient descent is an optimization algorithm used to minimize som
[ex5q8]: images/day1/ex5/w2_day1_ex5_q8.png "MSE + Gradient descent"
9. This question is validated if the coefficients and intercept returned are:
```
Coefficients: [42.61943029]
Intercept: 99.18581817296929
```
```console
Coefficients: [42.61943029]
Intercept: 99.18581817296929
```

28
one_md_per_day_format/piscine/Week2/day2.md

@ -31,8 +31,8 @@ More details:
https://towardsdatascience.com/understanding-logistic-regression-9b02c2aec102
For the linear regression exercices, the loss (Mean Square Error - MSE) is minimized with an algorithm called **gradient descent**. In the classification, the loss MSE can't be used because the output of the model is 0 or 1 (for binary classfication).
The **logloss** or **cross entropy** is the loss used for classification. Similarly, it has some nice mathematical properties. The minimization of the **logloss** is not covered in the exercices. However, since it is used in most machine learning models for classification, I recommand to spend some time reading the related article. This article gives a nice example of how it works:
For the linear regression exercises, the loss (Mean Square Error - MSE) is minimized with an algorithm called **gradient descent**. In the classification, the loss MSE can't be used because the output of the model is 0 or 1 (for binary classfication).
The **logloss** or **cross entropy** is the loss used for classification. Similarly, it has some nice mathematical properties. The minimization of the **logloss** is not covered in the exercises. However, since it is used in most machine learning models for classification, I recommand to spend some time reading the related article. This article gives a nice example of how it works:
https://towardsdatascience.com/cross-entropy-for-classification-d98e7f974451
@ -48,9 +48,9 @@ https://medium.com/swlh/what-is-logistic-regression-62807de62efa
# Exercice 1 Logistic regression in Scikit-learn
# Exercise 1 Logistic regression in Scikit-learn
The goal of this exercice is to learn to use Scikit-learn to classify data.
The goal of this exercise is to learn to use Scikit-learn to classify data.
```
X = [[0],[0.1],[0.2], [1],[1.1],[1.2], [1.3]]
y = [0,0,0,1,1,1,0]
@ -93,9 +93,9 @@ Score:
```
# Exercice 2 Sigmoid
# Exercise 2 Sigmoid
The goal of this exercice is to learn to compute and plot the sigmoid function.
The goal of this exercise is to learn to compute and plot the sigmoid function.
1. On the same plot, plot the sigmoid function and the custom sigmoids defined as:
```
@ -121,9 +121,9 @@ The plot should look like this:
# Exercice 3 Decision boundary
# Exercise 3 Decision boundary
The goal of this exercice is to learn to fit a logistic regression on simple examples and to understand how the algorithm separated the data from the different classes.
The goal of this exercise is to learn to fit a logistic regression on simple examples and to understand how the algorithm separated the data from the different classes.
## 1 dimension
@ -304,9 +304,9 @@ As mentioned, it is not required to shift the class prediction to make the plot
# Exercice 4: Train test split
# Exercise 4: Train test split
The goal of this exercice is to learn to split a classification data set. The idea is the same as splitting a regression data set but there's one important detail specific to the classification: the proportion of each class in the train set and test set.
The goal of this exercise is to learn to split a classification data set. The idea is the same as splitting a regression data set but there's one important detail specific to the classification: the proportion of each class in the train set and test set.
@ -358,9 +358,9 @@ The proportion of class `1` is **0.125** in the train set and **1.** in the test
2. This question is validated if the proportion of class `1` is **0.3** for both sets.
# Exercice 5 Breast Cancer prediction
# Exercise 5 Breast Cancer prediction
The goal of this exercice is to use Logistic Regression
The goal of this exercise is to use Logistic Regression
to predict breast cancer. It is always important to understand the data before training any Machine Learning algorithm. The data is described in **breast-cancer-wisconsin.names**. I suggest to add manually the column names in the DataFrame.
Preliminary:
@ -439,9 +439,9 @@ array([[90, 2],
As said, for some reasons, you may have slighty different results because of the data splitting. However, the values you have in the confusion matrix should be close to these results.
# Exercice 6 Multi-class (Optional)
# Exercise 6 Multi-class (Optional)
The goal of this exercice is to learn to train a classfication algorithm on a multi-class labelled data.
The goal of this exercise is to learn to train a classfication algorithm on a multi-class labelled data.
Some algorithms as SVM or Logistic Regression do not natively support multi-class (more than 2 classes). There are some approaches that allow to use these algorithms on multi-class data.
Let's assume we work with 3 classes: A, B and C.

28
one_md_per_day_format/piscine/Week2/day4.md

@ -36,9 +36,9 @@ https://www.kdnuggets.com/2018/06/right-metric-evaluating-machine-learning-model
https://scikit-learn.org/stable/modules/model_evaluation.html
# Exercice 1 MSE Scikit-learn
# Exercise 1 MSE Scikit-learn
The goal of this exercice is to learn to use `sklearn.metrics` to compute the mean squared error (MSE).
The goal of this exercise is to learn to use `sklearn.metrics` to compute the mean squared error (MSE).
1. Compute the MSE using `sklearn.metrics` on `y_true` and `y_pred` below:
@ -51,10 +51,10 @@ y_pred = [90, 48, 2, 2, -4]
1. This question is validated if the MSE outputted is **2.25**.
# Exercice 2 Accuracy Scikit-learn
# Exercise 2 Accuracy Scikit-learn
The goal of this exercice is to learn to use `sklearn.metrics` to compute the accuracy.
The goal of this exercise is to learn to use `sklearn.metrics` to compute the accuracy.
1. Compute the accuracy using `sklearn.metrics` on `y_true` and `y_pred` below:
@ -68,9 +68,9 @@ y_true = [0, 0, 1, 1, 1, 1, 0]
# Exercice 3 Regression
# Exercise 3 Regression
The goal of this exercice is to learn to evaluate a machine learning model using many regression metrics.
The goal of this exercise is to learn to evaluate a machine learning model using many regression metrics.
Preliminary:
@ -138,13 +138,13 @@ pipe.fit(X_train, y_train)
MSE on the test set: 0.5537420654727396
```
This result shows that the model has slightly better results on the train set than the test set. That's frequent since it is easier to get a better grade on an exam we studied than an exam that is different from what was prepared. However, the results are not good: r2 ~ 0.3. Fitting non linear models as the Random Forest on this data may improve the results. That's the goal of the exercice 5.
This result shows that the model has slightly better results on the train set than the test set. That's frequent since it is easier to get a better grade on an exam we studied than an exam that is different from what was prepared. However, the results are not good: r2 ~ 0.3. Fitting non linear models as the Random Forest on this data may improve the results. That's the goal of the exercise 5.
# Exercice 4 Classification
# Exercise 4 Classification
The goal of this exercice is to learn to evaluate a machine learning model using many classification metrics.
The goal of this exercise is to learn to evaluate a machine learning model using many classification metrics.
Preliminary:
@ -232,9 +232,9 @@ Having a 99% ROC AUC is not usual. The data set we used is easy to classify. On
# Exercice 5 Machine Learning models
# Exercise 5 Machine Learning models
The goal of this exercice is to have an overview of the existing Machine Learning models and to learn to call them from scikit learn.
The goal of this exercise is to have an overview of the existing Machine Learning models and to learn to call them from scikit learn.
We will focus on:
- SVM/ SVC
@ -363,9 +363,9 @@ Take time to have basic understanding of the role of the basic hyperparameters a
It is important to notice that the Decision Tree overfits very easily. It learns easily the training data but is not able to extrapolate on the test set. This algorithm is not used a lot.
However, Random Forest and Gradient Boosting propose a solid approach to correct the overfitting (in that case the parameters `max_depth` is set to None that is why the Random Forest overfits the data). These two algorithms are used intensively in Machine Learning Projets.
# Exercice 6 Grid Search
# Exercise 6 Grid Search
The goal of this exercice is to learn how to make an exhaustive search over specified parameter values for an estimator. This is very useful because the hyperparameters which are the paremeters of the model impact the performance of the model.
The goal of this exercise is to learn how to make an exhaustive search over specified parameter values for an estimator. This is very useful because the hyperparameters which are the paremeters of the model impact the performance of the model.
The scikit learn object that runs the Grid Search is called GridSearchCV. We will learn tomorrow about the cross validation. For now, let us set the parameter **cv** to `[(np.arange(18576), np.arange(18576,20640))]`.
This means that GridSearchCV splits the data set in a train and test set.
@ -450,7 +450,7 @@ Ressources:
return gs.best_estimator_, gs.best_params_, gs.best_score_
```
In my case, the gridsearch parameters are not interesting. Even if I reduced the overfitting of the Random Forest, the score on the test is lower than the score on the test returned by the Gradient Boosting in the previous exercice without optimal parameters search.
In my case, the gridsearch parameters are not interesting. Even if I reduced the overfitting of the Random Forest, the score on the test is lower than the score on the test returned by the Gradient Boosting in the previous exercise without optimal parameters search.
3. This question is validated if the code used is:

10
one_md_per_day_format/piscine/Week2/template.md

@ -16,21 +16,21 @@
## Ressources
# Exercice 1
# Exercise 1
# Exercice 2
# Exercise 2
# Exercice 3
# Exercise 3
# Exercice 4
# Exercise 4
# Exercice 5
# Exercise 5

10
one_md_per_day_format/piscine/Week3/template.md

@ -16,21 +16,21 @@
## Ressources
# Exercice 1
# Exercise 1
# Exercice 2
# Exercise 2
# Exercice 3
# Exercise 3
# Exercice 4
# Exercise 4
# Exercice 5
# Exercise 5

18
one_md_per_day_format/piscine/Week3/w3day02.md

@ -10,7 +10,7 @@ The goal of this day is to learn to use Keras to build Neural Networks.
There are two ways to build Keras models: sequential and functional.
The sequential API allows you to create models layer-by-layer for most problems. It is limited in that it does not allow you to create models that share layers or have multiple inputs or outputs. The exercices focuses on the usage of the sequential API.
The sequential API allows you to create models layer-by-layer for most problems. It is limited in that it does not allow you to create models that share layers or have multiple inputs or outputs. The exercises focuses on the usage of the sequential API.
'2.4.3'
@ -25,9 +25,9 @@ A developper
## Ressources
https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
# Exercice 1 Sequential
# Exercise 1 Sequential
The goal of this exercice is to learn to call the object `Sequential`.
The goal of this exercise is to learn to call the object `Sequential`.
1. Put the object Sequential in a variable named `model` and print the variable `model`.
@ -39,9 +39,9 @@ The goal of this exercice is to learn to call the object `Sequential`.
# Exercice 2 Dense
# Exercise 2 Dense
The goal of this exercice is to learn to create layers of neurons. Keras proposes options to create custom layers. The neural networks build in these exercices do not require custom layers. `Dense` layers do the job. A dense layer is simply a layer where each unit or neuron is connected to each neuron in the next layer. As seen yesterday, there are three main types of layers: input, hidden and output. The **input layer** that specifies the number of inputs (features) is not represented as a layer in Keras. However, `Dense` has a parameter `input_dim` that gives the number of inputs in the previous layer. The output layer as any hidden layer can be created using `Dense`, the only difference is that the output layer contains one single neuron.
The goal of this exercise is to learn to create layers of neurons. Keras proposes options to create custom layers. The neural networks build in these exercises do not require custom layers. `Dense` layers do the job. A dense layer is simply a layer where each unit or neuron is connected to each neuron in the next layer. As seen yesterday, there are three main types of layers: input, hidden and output. The **input layer** that specifies the number of inputs (features) is not represented as a layer in Keras. However, `Dense` has a parameter `input_dim` that gives the number of inputs in the previous layer. The output layer as any hidden layer can be created using `Dense`, the only difference is that the output layer contains one single neuron.
1. Create a `Dense` layer with these parameters and return the output of `get_config`:
@ -121,9 +121,9 @@ The goal of this exercice is to learn to create layers of neurons. Keras propose
'bias_constraint': None}
```
# Exercice 3 Architecture
# Exercise 3 Architecture
The goal of this exercice is to combine the layers and to create a neural network.
The goal of this exercise is to combine the layers and to create a neural network.
1. Create a neural network for regression with the following architecture and return `print(model.summary())`:
@ -145,9 +145,9 @@ The goal of this exercice is to combine the layers and to create a neural networ
```
The first two layers could use another activation function that sigmoid (eg: relu)
# Exercice 4 Optimize
# Exercise 4 Optimize
The goal of this exercice is to learn to train the neural network. Once the architecture of the neural network is set there are two steps to train the neural network:
The goal of this exercise is to learn to train the neural network. Once the architecture of the neural network is set there are two steps to train the neural network:
- `compile`: The compilation step aims to set the loss function, to choose the algoithm to minimize the chosen loss function and to choose the metric the model outputs.

22
one_md_per_day_format/piscine/Week3/w3day03.md

@ -24,9 +24,9 @@ A developper
https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
# Exercice 1 Regression - Optimize
# Exercise 1 Regression - Optimize
The goal of this exercice is to learn to set up the optimization for a regression neural network. There's no code to run in that exercice. In W2D2E3, we implemented a neural network designed for regression. We will be using this neural network:
The goal of this exercise is to learn to set up the optimization for a regression neural network. There's no code to run in that exercise. In W2D2E3, we implemented a neural network designed for regression. We will be using this neural network:
```
model = keras.Sequential()
@ -68,9 +68,9 @@ https://keras.io/api/losses/regression_losses/
https://keras.io/api/metrics/regression_metrics/
# Exercice 2 Regression example
# Exercise 2 Regression example
The goal of this exercice is to learn to train a neural network to perform a regression on a data set.
The goal of this exercise is to learn to train a neural network to perform a regression on a data set.
The data set is Auto MPG Dataset and the go is to build a model to predict the fuel efficiency of late-1970s and early 1980s automobiles. To do this, provide the model with a description of many automobiles from that time period. This description includes attributes like: cylinders, displacement, horsepower, and weight.
https://www.tensorflow.org/tutorials/keras/regression
@ -150,9 +150,9 @@ The output neuron has to be `Dense(1)` - by defaut the activation funtion is lin
*Hint*: To get the score on the test set, `evaluate` could have been used: `model.evaluate(X_test_scaled, y_test)`.
# Exercice 3 Multi classification - Softmax
# Exercise 3 Multi classification - Softmax
The goal of this exercice is to learn to a neural network architecture for multi-class data. This is an important type of problem on which to practice with neural networks because the three class values require specialized handling. A multi-classification neural network uses as output layer a **softmax** layer. The **softmax** activation function is an extension of the sigmoid as it is designed to output the probabilities to belong to each class in a multi-class problem. This output layer has to contain as much neurons as classes in the multi-classification problem. This article explains in detail how it works. https://developers.google.com/machine-learning/crash-course/multi-class-neural-networks/softmax
The goal of this exercise is to learn to a neural network architecture for multi-class data. This is an important type of problem on which to practice with neural networks because the three class values require specialized handling. A multi-classification neural network uses as output layer a **softmax** layer. The **softmax** activation function is an extension of the sigmoid as it is designed to output the probabilities to belong to each class in a multi-class problem. This output layer has to contain as much neurons as classes in the multi-classification problem. This article explains in detail how it works. https://developers.google.com/machine-learning/crash-course/multi-class-neural-networks/softmax
Let us assume we want to classify images and we know they contain either apples, bears, candies, eggs or dogs (extension of the example in the link above).
@ -175,9 +175,9 @@ Let us assume we want to classify images and we know they contain either apples,
model.add(Dense(5, activation= 'softmax'))
```
# Exercice 4 Multi classification - Optimize
# Exercise 4 Multi classification - Optimize
The goal of this exercice is to learn to optimize a multi-classification neural network. As learnt previously, the loss function used in binary classification is the log loss - also called in Keras `binary_crossentropy`. This function is defined for binary classification and can be extended to multi-classfication. In Keras, the extended loss that supports multi-classification is `binary_crossentropy`. There's no code to run in that exercice.
The goal of this exercise is to learn to optimize a multi-classification neural network. As learnt previously, the loss function used in binary classification is the log loss - also called in Keras `binary_crossentropy`. This function is defined for binary classification and can be extended to multi-classfication. In Keras, the extended loss that supports multi-classification is `binary_crossentropy`. There's no code to run in that exercise.
1. Fill the chunk of code below in order to optimize the neural network defined in the previous exercise. Choose the adapted loss, adam as optimizer and the accuracy as metric.
@ -196,9 +196,9 @@ model.compile(loss='categorical_crossentropy',
metrics=['accuracy'])
```
# Exercice 5 Multi classification example
# Exercise 5 Multi classification example
The goal of this exercice is to learn to use a neural network to classify a multiclass data set. The data set used is the Iris data set which allows to classify flower given basic features as flower's measurement.
The goal of this exercise is to learn to use a neural network to classify a multiclass data set. The data set used is the Iris data set which allows to classify flower given basic features as flower's measurement.
Preliminary:
- Split train test. Keep 20% for the test set. Use `random_state=1`.
@ -245,6 +245,6 @@ model.fit(X_train_sc, y_train_multi_class, epochs = 1000, batch_size=20)
# Exercice 6 GridSearch
# Exercise 6 GridSearch
https://medium.com/@am.benatmane/keras-hyperparameter-tuning-using-sklearn-pipelines-grid-search-with-cross-validation-ccfc74b0ce9f

40
one_md_per_day_format/piscine/Week3/w3day04.md

@ -29,12 +29,12 @@ Les packages NLTK and Spacy to do the preprocessing
## Ressources
# Exercice 1: Lowercase
# Exercise 1: Lowercase
The goal of this exercice is to learn to lowercase text data in Python. Note that if the volume of data is low the text data can be stored in a Pandas DataFrame or Series. But, when dealing with high volumes (high but not huge), using a Pandas DataFrame or Series is not efficient. Data structures as dictionaries or list are more adapted.
The goal of this exercise is to learn to lowercase text data in Python. Note that if the volume of data is low the text data can be stored in a Pandas DataFrame or Series. But, when dealing with high volumes (high but not huge), using a Pandas DataFrame or Series is not efficient. Data structures as dictionaries or list are more adapted.
```
list_ = ["This is my first NLP exercice", "wtf!!!!!"]
list_ = ["This is my first NLP exercise", "wtf!!!!!"]
series_data = pd.Series(list_, name='text')
```
@ -46,21 +46,21 @@ Note: Do not change the text manually !
1. This question is validated if the output is:
```
0 this is my first nlp exercice
0 this is my first nlp exercise
1 wtf!!!!!
Name: text, dtype: object
```
2. This question is validated if the output is:
```
0 THIS IS MY FIRST NLP EXERCICE
0 THIS IS MY FIRST NLP EXERCISE
1 WTF!!!!!
Name: text, dtype: object
```
# Exerice 2: Punctation
The goal of this exerice is to learn to deal with punctuation. In Natural Language Processing, some basic approaches as Bag of Words (exercice X) model the text as an unordered combination of words. In that case the punctuation is not always useful as it doesn't add information to the model. That is why is removed.
The goal of this exerice is to learn to deal with punctuation. In Natural Language Processing, some basic approaches as Bag of Words (exercise X) model the text as an unordered combination of words. In that case the punctuation is not always useful as it doesn't add information to the model. That is why is removed.
1. Remove the punctuation from this sentence. All characters in !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ are considered as punctuation.
@ -81,9 +81,9 @@ The goal of this exerice is to learn to deal with punctuation. In Natural Langua
```
# Exercice 3 Tokenization
# Exercise 3 Tokenization
The goal of this exercice is to learn to tokenize as text. This step is important because it splits the text into token. A token could be a sentence or a word.
The goal of this exercise is to learn to tokenize as text. This step is important because it splits the text into token. A token could be a sentence or a word.
```
text = """Bitcoin is a cryptocurrency invented in 2008 by an unknown person or group of people using the name Satoshi Nakamoto. The currency began use in 2009 when its implementation was released as open-source software."""
@ -152,13 +152,13 @@ https://www.analyticsvidhya.com/blog/2019/07/how-get-started-nlp-6-unique-ways-p
```
# Exercice 4 Stop words
# Exercise 4 Stop words
The goal of this exercice is to learn to remove stop words with NLTK. Stop words usually refers to the most common words in a language. For example: "and", "is", "a" are stop words and do not add information to a sentence.
The goal of this exercise is to learn to remove stop words with NLTK. Stop words usually refers to the most common words in a language. For example: "and", "is", "a" are stop words and do not add information to a sentence.
```
text = """
The goal of this exercice is to learn to remove stop words with NLTK. Stop words usually refers to the most common words in a language.
The goal of this exercise is to learn to remove stop words with NLTK. Stop words usually refers to the most common words in a language.
"""
```
1. Remove stop words from this sentence and return the list of work tokens without stop words.
@ -168,13 +168,13 @@ The goal of this exercice is to learn to remove stop words with NLTK. Stop word
1. This question is validated if, using NLTK, the ouptut is:
```
['The', 'goal', 'exercice', 'learn', 'remove', 'stop', 'words', 'NLTK', '.', 'Stop', 'words', 'usually', 'refers', 'common', 'words', 'language', '.']
['The', 'goal', 'exercise', 'learn', 'remove', 'stop', 'words', 'NLTK', '.', 'Stop', 'words', 'usually', 'refers', 'common', 'words', 'language', '.']
```
# Exercice 5 Stemming
# Exercise 5 Stemming
The goal of this exercice is to learn to use stemming using NLTK. As explained in details in the article, stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language.
The goal of this exercise is to learn to use stemming using NLTK. As explained in details in the article, stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language.
Note: The output of a stemmer is a word that may not exist in the dictionnary.
@ -196,9 +196,9 @@ The interviewer interviews the president in an interview
```
# Exercice 6: Text preprocessing
# Exercise 6: Text preprocessing
The goal of this exercice is to learn to create a function to prepocess and clean a text using NLTK.
The goal of this exercise is to learn to create a function to prepocess and clean a text using NLTK.
Put this text in a variable:
@ -267,22 +267,22 @@ https://towardsdatascience.com/nlp-preprocessing-with-nltk-3c04ee00edc0
```
# Exercice 7: Bag of Word representation
# Exercise 7: Bag of Word representation
https://machinelearningmastery.com/gentle-introduction-bag-words-model/
The goal of this exercice is to understand how to create a Bag of Word (BoW) model on a corpus of texts. More precesily we will create a labeled data set from textual data using a word count matrix.
The goal of this exercise is to understand how to create a Bag of Word (BoW) model on a corpus of texts. More precesily we will create a labeled data set from textual data using a word count matrix.
As explained in the ressource, the Bag of word reprensation makes the assumption that the order in which the words appear in a text doesn't matter. There are different types of Bag of words reprensations:
- Boolean: Each document is a boolean vector
- Wordcount: Each document is a word count vector
- TFIDF: Each document is a score vector. The score is detailed in the next exercice.
- TFIDF: Each document is a score vector. The score is detailed in the next exercise.
The data `tweets_train.txt` contains tweets labeled with a sentiment. It gives the positivity of a tweet.
Steps:
1. Preprocess the data using the function implemented in the previous exercice. And, using from `CountVectorizer` of scikitlearn with `max_features=500` compute the wordcount of the tweets. The output is a sparse matrix.
1. Preprocess the data using the function implemented in the previous exercise. And, using from `CountVectorizer` of scikitlearn with `max_features=500` compute the wordcount of the tweets. The output is a sparse matrix.
- Check the shape of the word count matrix
- Set **max_features** to 500 of the initial size of the dictionnary.

24
one_md_per_day_format/piscine/Week3/w3day05.md

@ -19,9 +19,9 @@ There are many type of language models pre-trained in Spacy. Each has its specif
## Ressources
# Exercice 1 Embedding 1
# Exercise 1 Embedding 1
The goal of this exercice is to learn to load an embedding on SpaCy.
The goal of this exercise is to learn to load an embedding on SpaCy.
1. Install and load `en_core_web_sm` embedding. Compute the embedding of `car`.
@ -40,10 +40,10 @@ array([ 1.0522802e+00, 1.4806499e+00, 7.7402556e-01, 1.0373484e+00,
```
# Exercice 2: Tokenization
# Exercise 2: Tokenization
The goal of this exercice is to learn to tokenize a document using Spacy. We did this using NLTK yesterday.
The goal of this exercise is to learn to tokenize a document using Spacy. We did this using NLTK yesterday.
1. Tokenize the text below and print the tokens
@ -68,9 +68,9 @@ The goal of this exercice is to learn to tokenize a document using Spacy. We did
.
```
## Exercice 3 Embeddings 2
## Exercise 3 Embeddings 2
The goal of this exercice is to learn to use SpaCy embedding on a document.
The goal of this exercise is to learn to use SpaCy embedding on a document.
1. Compute the embedding of all the words in this sentence. The language model considered is `en_core_web_md`
@ -106,9 +106,9 @@ https://medium.com/datadriveninvestor/cosine-similarity-cosine-distance-6571387f
[logo]: w3day05ex1_plot.png "Plot"
# Exercice 4 Sentences' similarity
# Exercise 4 Sentences' similarity
The goal of this exerice is to learn to compute the similarity between two sentences. As explained in the documentation: **The word embedding of a full sentence is simply the average over all different words**. This is how `similarity` works in SpaCy. This small use case is very interesting because if we build a corpus of sentences that express an intention as **buy shoes**, then we can detect this intention and use it to propose shoes advertisement for customers. The language model used in this exercice is `en_core_web_sm`.
The goal of this exerice is to learn to compute the similarity between two sentences. As explained in the documentation: **The word embedding of a full sentence is simply the average over all different words**. This is how `similarity` works in SpaCy. This small use case is very interesting because if we build a corpus of sentences that express an intention as **buy shoes**, then we can detect this intention and use it to propose shoes advertisement for customers. The language model used in this exercise is `en_core_web_sm`.
1. Compute the similarities (3 in total) between these sentences:
@ -135,9 +135,9 @@ The goal of this exerice is to learn to compute the similarity between two sente
# Exercice 5: NER
# Exercise 5: NER
The goal of this exercice is to learn to use a Named entity recognition algorithm to detect entities.
The goal of this exercise is to learn to use a Named entity recognition algorithm to detect entities.
```
Apple Inc. is an American multinational technology company headquartered in Cupertino, California, that designs, develops, and sells consumer electronics, computer software, and online services. It is considered one of the Big Five companies in the U.S. information technology industry, along with Amazon, Google, Microsoft, and Facebook.
@ -189,9 +189,9 @@ https://en.wikipedia.org/wiki/Named-entity_recognition
```
# Exercice 6 Part-of-speech tags
# Exercise 6 Part-of-speech tags
The goal od this exercice is to learn to use the Part-of-speech tags (**POS TAG**) using Spacy. As explained in wikipedia, the POS TAG is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context.
The goal od this exercise is to learn to use the Part-of-speech tags (**POS TAG**) using Spacy. As explained in wikipedia, the POS TAG is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context.
Example

20
one_md_per_day_format/piscine/Week3/w3day1.md

@ -23,9 +23,9 @@ https://srnghn.medium.com/deep-learning-overview-of-neurons-and-activation-funct
Reproduire cet article sans back prop
https://towardsdatascience.com/machine-learning-for-beginners-an-introduction-to-neural-networks-d49f22d238f9
# Exercice 1 The neuron
# Exercise 1 The neuron
The goal of this exercice is to understand the role of a neuron and to implement a neuron.
The goal of this exercise is to understand the role of a neuron and to implement a neuron.
An artificial neuron, the basic unit of the neural network, (also referred to as a perceptron) is a mathematical function. It takes one or more inputs that are multiplied by values called “weights” and added together. This value is then passed to a non-linear function, known as an activation function, to become the neuron’s output.
@ -91,7 +91,7 @@ https://victorzhou.com/blog/intro-to-neural-networks/
# Exerice 2 Neural network
The goal of this exercice is to understand how to combine three neurons to form a neural network. A neural newtwork is nothing else than neurons connected together. As shown in the figure the neural network is composed of **layers**:
The goal of this exercise is to understand how to combine three neurons to form a neural network. A neural newtwork is nothing else than neurons connected together. As shown in the figure the neural network is composed of **layers**:
- Input layer: it only represents input data. **It doesn't contain neurons**.
- Output layer: it represents the last layer. It contains a neuron (in some cases more than 1).
@ -99,7 +99,7 @@ The goal of this exercice is to understand how to combine three neurons to form
Notice that the neuron **o1** in the output layer takes as input the output of the neurons **h1** and **h2** in the hidden layer.
In exercice 1, you implemented this neuron.
In exercise 1, you implemented this neuron.
![alt text][neuron]
[neuron]: images/day1/ex2/w3_day1_neuron.png "Plot"
@ -143,9 +143,9 @@ Now, we add two more neurons:
1. This question is validated the output is: **0.9524917424084265**
# Exercice 3 Log loss
# Exercise 3 Log loss
The goal of this exercice is to implement the Log loss function. As mentioned last week, this function is used in classification as a **loss function**. It means that the better the classifier is, the smaller the loss function is. W2D1, you implemented the gradient descent on the MSE loss to update the weights of the linear regression. Similarly, the minimization of the Log loss leads to finding optimal weights.
The goal of this exercise is to implement the Log loss function. As mentioned last week, this function is used in classification as a **loss function**. It means that the better the classifier is, the smaller the loss function is. W2D1, you implemented the gradient descent on the MSE loss to update the weights of the linear regression. Similarly, the minimization of the Log loss leads to finding optimal weights.
Log loss: - 1/n * Sum[(y_true*log(y_pred) + (1-y_true)*log(1-y_pred))]
@ -163,7 +163,7 @@ https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html
1. This question is validated if the output is: **0.5472899351247816**.
# Exercice 4 Forward propagation
# Exercise 4 Forward propagation
The goal of this exerice is to compute the log loss on the output of the forward propagation. The data used is the tiny data set below.
@ -198,9 +198,9 @@ The goal if the network is to predict the success at the exam given math and che
2. This question is validated if the logloss for the 4 students is **0.5485133607757963**.
# Exercice 5 Regression
# Exercise 5 Regression
The goal of this exercice is to learn to adapt the output layer to regression.
The goal of this exercise is to learn to adapt the output layer to regression.
As a reminder, one of reasons for which the sigmoid is used in classification is because it contracts the output between 0 and 1 which is the expected output range for a probability (W2D2: Logistic regression). However, the output of the regression is not a probability.
In order to perform a regression using a neural network, the activation function of the neuron on the output layer has to be modified to **identity function**. In mathematics, the identity function is: **f(x) = x**. In other words it means that it returns the input as so. The three steps become:
@ -218,7 +218,7 @@ In order to perform a regression using a neural network, the activation function
All other neurons' activation function **doesn't change**.
1. Adapt the neuron class implemented in exercice 1. It now takes as a parameter `regression` which is boolean. When its value is `True`, `feedforward` should use the identity function as activation function instead of the sigmoid function.
1. Adapt the neuron class implemented in exercise 1. It now takes as a parameter `regression` which is boolean. When its value is `True`, `feedforward` should use the identity function as activation function instead of the sigmoid function.
```

Loading…
Cancel
Save