Browse Source

feat: clean folders

pull/42/head
Badr Ghazlane 2 years ago
parent
commit
0938f63507
  1. 23
      piscine/week01/day01/ex00/audit/readme.md
  2. 62
      piscine/week01/day01/ex00/readme.md
  3. 19
      piscine/week01/day01/ex01/audit/readme.md
  4. 21
      piscine/week01/day01/ex01/readme.md
  5. 3
      piscine/week01/day01/ex02/audit/readme.md
  6. 6
      piscine/week01/day01/ex02/readme.md
  7. 15
      piscine/week01/day01/ex03/audit/readme.md
  8. 9
      piscine/week01/day01/ex03/readme.md
  9. 40
      piscine/week01/day01/ex04/audit/readme.md
  10. 17
      piscine/week01/day01/ex04/readme.md
  11. 19
      piscine/week01/day01/ex05/audit/readme.md
  12. 17
      piscine/week01/day01/ex05/readme.md
  13. 28
      piscine/week01/day01/ex06/audit/readme.md
  14. 20
      piscine/week01/day01/ex06/readme.md
  15. 32
      piscine/week01/day01/ex07/audit/readme.md
  16. 18
      piscine/week01/day01/ex07/readme.md
  17. 52
      piscine/week01/day01/ex08/audit/readme.md
  18. 1600
      piscine/week01/day01/ex08/data/winequality-red.csv
  19. 72
      piscine/week01/day01/ex08/data/winequality.names
  20. 24
      piscine/week01/day01/ex08/readme.md
  21. 6
      piscine/week01/day01/ex09/audit/readme.md
  22. 10
      piscine/week01/day01/ex09/data/model_forecasts.txt
  23. 26
      piscine/week01/day01/ex09/readme.md
  24. 31
      piscine/week01/day01/readme.md
  25. 9
      piscine/week01/day02/ex00/audit/readme.md
  26. 64
      piscine/week01/day02/ex00/readme.md
  27. 17
      piscine/week01/day02/ex01/audit/readme.md
  28. 17
      piscine/week01/day02/ex01/readme.md
  29. 101
      piscine/week01/day02/ex02/audit/readme.md
  30. 1
      piscine/week01/day02/ex02/data/household_power_consumption.txt
  31. 24
      piscine/week01/day02/ex02/readme.md
  32. 49
      piscine/week01/day02/ex03/audit/readme.md
  33. 20001
      piscine/week01/day02/ex03/data/Ecommerce_purchases.txt
  34. 20
      piscine/week01/day02/ex03/readme.md
  35. 32
      piscine/week01/day02/ex04/audit/readme.md
  36. 151
      piscine/week01/day02/ex04/data/iris.csv
  37. 152
      piscine/week01/day02/ex04/data/iris.data
  38. 26
      piscine/week01/day02/ex04/readme.md
  39. 48
      piscine/week01/day02/readme.md
  40. 9
      piscine/week01/day03/ex00/audit/readme.md
  41. 62
      piscine/week01/day03/ex00/readme.md
  42. 8
      piscine/week01/day03/ex01/audit/readme.md
  43. 28
      piscine/week01/day03/ex01/readme.md
  44. BIN
      piscine/week01/day03/ex01/w1day03_ex1_plot1.png
  45. 8
      piscine/week01/day03/ex02/audit/readme.md
  46. 26
      piscine/week01/day03/ex02/readme.md
  47. BIN
      piscine/week01/day03/ex02/w1day03_ex2_plot1.png
  48. 11
      piscine/week01/day03/ex03/audit/readme.md
  49. 18
      piscine/week01/day03/ex03/readme.md
  50. BIN
      piscine/week01/day03/ex03/w1day03_ex3_plot1.png
  51. 12
      piscine/week01/day03/ex04/audit/readme.md
  52. 25
      piscine/week01/day03/ex04/readme.md
  53. BIN
      piscine/week01/day03/ex04/w1day03_ex4_plot1.png
  54. 11
      piscine/week01/day03/ex05/audit/readme.md
  55. 18
      piscine/week01/day03/ex05/readme.md
  56. BIN
      piscine/week01/day03/ex05/w1day03_ex5_plot1.png
  57. 25
      piscine/week01/day03/ex06/audit/readme.md
  58. 34
      piscine/week01/day03/ex06/readme.md
  59. BIN
      piscine/week01/day03/ex06/w1day03_ex6_plot1.png
  60. 25
      piscine/week01/day03/ex07/audit/readme.md
  61. 24
      piscine/week01/day03/ex07/readme.md
  62. BIN
      piscine/week01/day03/ex07/w1day03_ex7_plot1.png
  63. 47
      piscine/week01/day03/readme.md
  64. 9
      piscine/week01/day04/ex00/audit/readme.md
  65. 55
      piscine/week01/day04/ex00/readme.md
  66. 8
      piscine/week01/day04/ex01/audit/readme.md
  67. 14
      piscine/week01/day04/ex01/readme.md
  68. 23
      piscine/week01/day04/ex02/audit/readme.md
  69. 46
      piscine/week01/day04/ex02/readme.md
  70. 14
      piscine/week01/day04/ex03/audit/readme.md
  71. 34
      piscine/week01/day04/ex03/readme.md
  72. 56
      piscine/week01/day04/ex04/audit/readme.md
  73. 65
      piscine/week01/day04/ex04/readme.md
  74. 8
      piscine/week01/day04/ex05/audit/readme.md
  75. 23
      piscine/week01/day04/ex05/readme.md
  76. 12
      piscine/week01/day04/ex06/audit/readme.md
  77. 32
      piscine/week01/day04/ex06/readme.md
  78. 40
      piscine/week01/day04/readme.md
  79. 9
      piscine/week01/day05/ex00/audit/readme.md
  80. 52
      piscine/week01/day05/ex00/readme.md
  81. 35
      piscine/week01/day05/ex01/audit/readme.md
  82. 7
      piscine/week01/day05/ex01/readme.md
  83. 43
      piscine/week01/day05/ex02/audit/readme.md
  84. 10120
      piscine/week01/day05/ex02/data/AAPL.csv
  85. 16
      piscine/week01/day05/ex02/readme.md
  86. 6
      piscine/week01/day05/ex03/audit/readme.md
  87. 31
      piscine/week01/day05/ex03/readme.md
  88. 62
      piscine/week01/day05/ex04/audit/readme.md
  89. 44
      piscine/week01/day05/ex04/readme.md
  90. 37
      piscine/week01/day05/readme.md
  91. 105
      piscine/week01/raid01/audit/readme.md
  92. 9032
      piscine/week01/raid01/data/fundamentals.csv
  93. 3771
      piscine/week01/raid01/data/sp500.csv
  94. 3645
      piscine/week01/raid01/data/stock_prices.csv
  95. BIN
      piscine/week01/raid01/images/w1_weekend_plot_pnl.png
  96. 155
      piscine/week01/raid01/readme.md
  97. 9
      piscine/week02/day01/ex00/audit/readme.md
  98. 81
      piscine/week02/day01/ex00/readme.md
  99. 22
      piscine/week02/day01/ex01/audit/readme.md
  100. 13
      piscine/week02/day01/ex01/readme.md
  101. Some files were not shown because too many files changed in this diff diff.show_more

23
piscine/week01/day01/ex00/audit/readme.md

@ -0,0 +1,23 @@
##### The exercice is validated is all questions of the exercice are validated
##### Activate the virtual environment. If you used `conda` run `conda activate ex00`
###### Does the shell specify the name `ex00` of the environment on the left ?
##### Run `python --version`
###### Does it print `Python 3.8.x`? x could be any number from 0 to 9.
##### Does `import jupyter` and `import numpy` run without any error ?
###### Have you used the followingthe command `jupyter notebook --port 8891` ?
###### Is there a file named `Notebook_ex00.ipynb` in the working directory ?
###### Is the following markdown code executed in a markdown cell in the first cell ?
```
# H1 TITLE
## H2 TITLE
```
###### Does the second cell contain `print("Buy the dip ?")` and return `Buy the dip ?` in the output section ?

62
piscine/week01/day01/ex00/readme.md

@ -0,0 +1,62 @@
# W1D01 Piscine AI - Data Science
## NumPy
The goal of this day is to understand practical usage of **NumPy**. **NumPy** is a commonly used Python data analysis package. By using **NumPy**, you can speed up your workflow, and interface with other packages in the Python ecosystem, like scikit-learn, that use **NumPy** under the hood. **NumPy** was originally developed in the mid 2000s, and arose from an even older package called Numeric. This longevity means that almost every data analysis or machine learning package for Python leverages **NumPy** in some way.
## Exercises of the day
- Exercise 0 Environment and libraries
- Exercise 1 Your first NumPy array
- Exercise 2 Zeros
- Exercise 3 Slicing
- Exercise 4 Random
- Exercise 5 Split, concatenate, reshape arrays
- Exercise 6 Broadcasting and Slicing
- Exercise 7 NaN
- Exercise 8 Wine
- Exercise 9 Football tournament
## Virtual Environment
- Python 3.x
- NumPy
- Jupyter or JupyterLab
*Version of NumPy I used to do the exercises: 1.18.1*.
I suggest to use the most recent one.
## Ressources
- https://medium.com/fintechexplained/why-should-we-use-NumPy-c14a4fb03ee9
- https://numpy.org/doc/
- https://jakevdp.github.io/PythonDataScienceHandbook/
# Exercise 0 Environment and libraries
The goal of this exercise is to set up the Python work environment with the required libraries and to learn to launch a `jupyter notebook`. Jupyter notebooks are very convenient as they allow to write and test code within seconds. However, it really easy to implement instable and not reproducible code using notebooks. Keep the notebook and the underlying code clean. An article below detail when the Notebook should be used. Notebook can be used for most of the exercices of the piscine as the goal is to experiment A LOT. But no worries, you'll be asked to build a more robust structure for all the projects.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
I recommend to use:
- the **last stable versions** of Python. However, for educational purpose you will install a specific version of Python in this exercise.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
1. Create a virtual environment named `ex00`, with Python `3.8`, with the following libraries: `numpy`, `jupyter`.
2. Launch a `jupyter notebook` on port `8891` and create a notebook named `Notebook_ex00`. `JupyterLab` can be used instead of Jupyter Notebook here.
3. Put the text `H1 TITLE` as **heading level 1** and `H2 TITLE` as **heading level 2** in the first cell.
4. Run `print("Buy the dip ?")` in the second cell
## Ressources:
- https://www.python.org/
- https://docs.conda.io/
- https://jupyter.org/
- https://numpy.org/
- https://towardsdatascience.com/jypyter-notebook-shortcuts-bf0101a98330
- https://odsc.medium.com/why-you-should-be-using-jupyter-notebooks-ea2e568c59f2

19
piscine/week01/day01/ex01/audit/readme.md

@ -0,0 +1,19 @@
##### This exercise is validated if the your_numpy_array is a NumPy array. It can be checked with `type(your_numpy_array)` that should be equal to `numpy.ndarray`. And if the type of is element are as follow.
##### Try and run the following code.
```python
for i in your_np_array:
print(type(i))
<class 'int'>
<class 'float'>
<class 'str'>
<class 'dict'>
<class 'list'>
<class 'tuple'>
<class 'set'>
<class 'bool'>
```
###### Does it display the right types as above?

21
piscine/week01/day01/ex01/readme.md

@ -0,0 +1,21 @@
# Exercise 1 Your first NumPy array
The goal of this exercise is to use many Python data types in **NumPy** arrays. **NumPy** arrays are intensively used in **NumPy** and **Pandas**. They are flexible and allow to use optimized **NumPy** underlying functions.
1. Create a NumPy array that contains: an integer, a float, a string, a dictionary, a list, a tuple, a set and a boolean.
The expected output is:
```python
for i in your_np_array:
print(type(i))
<class 'int'>
<class 'float'>
<class 'str'>
<class 'dict'>
<class 'list'>
<class 'tuple'>
<class 'set'>
<class 'bool'>
```

3
piscine/week01/day01/ex02/audit/readme.md

@ -0,0 +1,3 @@
##### The exercice is validated is all questions of the exercice are validated
##### The question 1 is validated is the solution uses `np.zeros` and if the shape of the array is `(300,)`
##### The question 2 is validated if the solution uses `reshape` and the shape of the array is `(3, 100)`

6
piscine/week01/day01/ex02/readme.md

@ -0,0 +1,6 @@
# Exercise 2 Zeros
The goal of this exercise is to learn to create a NumPy array with 0s.
1. Create a NumPy array of dimension **300** with zeros without filling it manually
2. Reshape it to **(3,100)**

15
piscine/week01/day01/ex03/audit/readme.md

@ -0,0 +1,15 @@
##### The exercice is validated is all questions of the exercice are validated
##### The question 1 is validated if the solution doesn't involve a for loop or writing all integers from 1 to 100 and if the array is: `np.array([1,...,100])`. The list from 1 to 100 can be generated with an iterator: `range`.
##### The question 2 is validated if the solution is: `integers[::2]`
##### The question 3 is validated if the solution is: `integers[::-2]`
##### The question 4 is validated if the array is: `np.array([0, 1,0,3,4,0,...,0,99,100])`. There are at least two ways to get this results without for loop. The first one uses `integers[1::3] = 0` and the second involves creating a boolean array that indexes the array:
```python
mask = (integers+1)%3 == 0
integers[mask] = 0
```

9
piscine/week01/day01/ex03/readme.md

@ -0,0 +1,9 @@
# Exercise 3 Slicing
The goal of this exercise is to learn NumPy indexing/slicing. It allows to access values of the NumPy array efficiently and without a for loop.
1. Create a NumPy array of dimension 1 that contains all integers from 1 to 100 ordered.
2. Without using a for loop and using the array created in Q1, create an array that contain all odd integers. The expected output is: `np.array([1,3,...,99])`. *Hint*: it takes one line
3. Without using a for loop and using the array created in Q1, create an array that contain all even integers reversed. The expected output is: `np.array([100,98,...,2])`. *Hint*: it takes one line
4. Using array of Q1, set the value of every 3 elements of the list (starting with the second) to 0. The expected output is: `np.array([[1,0,3,4,0,...,0,99,100]])`

40
piscine/week01/day01/ex04/audit/readme.md

@ -0,0 +1,40 @@
##### The exercice is validated is all questions of the exercice are validated
##### For this exercise, as the results may change depending on the version of the package or the OS, I give the code to correct the exercise. If the code is correct and the output is not the same as mine, it is accepted.
##### The question 1 is validated if the solution is: `np.random.seed(888)`
##### The question 2 is validated if the solution is: `np.random.randn(100)`. The value of the first element is `0.17620087373662233`.
##### The question 3 is validated if the solution is: `np.random.randint(1,11,(8,8))`.
```console
Given the NumPy version and the seed, you should have this output:
array([[ 7, 4, 8, 10, 2, 1, 1, 10],
[ 4, 1, 7, 4, 3, 5, 2, 8],
[ 3, 9, 7, 4, 9, 6, 10, 5],
[ 7, 10, 3, 10, 2, 1, 3, 7],
[ 3, 2, 3, 2, 10, 9, 5, 4],
[ 4, 1, 9, 7, 1, 4, 3, 5],
[ 3, 2, 10, 8, 6, 3, 9, 4],
[ 4, 4, 9, 2, 8, 5, 9, 5]])
```
##### The question 4 is validated if the solution is: `np.random.randint(1,18,(4,2,5))`.
```console
Given the NumPy version and the seed, you should have this output:
array([[[14, 16, 8, 15, 14],
[17, 13, 1, 4, 17]],
[[ 7, 15, 2, 8, 3],
[ 9, 4, 13, 9, 15]],
[[ 5, 11, 11, 14, 10],
[ 2, 1, 15, 3, 3]],
[[ 3, 10, 5, 16, 13],
[17, 12, 9, 7, 16]]])
```

17
piscine/week01/day01/ex04/readme.md

@ -0,0 +1,17 @@
# Exercise 4 Random
The goal of this exercise is to learn to generate random data.
In Data Science it is extremely useful to generate random data for many reasons:
Lack of real data, create a random benchmark, use varied data sets.
NumPy proposes a lot of options to generate random data. In statistics, assumptions are made on the distribution the data is from. All data distribution that can be generated randomly are described in the documentation. In this exercise we will focus on two distributions:
- Uniform: For example, if your goal is to generate a random number from 1 to 100 and that the probability that all the numbers is equal you'll need the uniform distribution. NumPy provides `randint` and `uniform` to generate uniform distribution
- Normal: The normal distribution is the most important probability distribution in statistics because it fits many natural phenomena.For example, if you need to generate a data sample that represents **Heights of 14 Year Old Girls** it can be done using the normal distribution. In that case, we need two parameters: the mean (1m51) and the standard deviation (0.0741m). NumPy provides `randn` to generate normal distribution (among other)
https://numpy.org/doc/stable/reference/random/generator.html
1. Set the seed to 888
2. Generate a **one-dimensional** array of size 100 with a normal distribution
3. Generate a **two-dimensional** array of size 8,8 with random integers from 1 to 10 - both included (same probability for each integer)
4. Generate a **three-dimensional** of size 4,2,5 array with random integers from 1 to 17 - both included (same probability for each integer)

19
piscine/week01/day01/ex05/audit/readme.md

@ -0,0 +1,19 @@
##### The exercice is validated is all questions of the exercice are validated
##### The question 1 is validated if the generated array is based on an iterator as `range` or `np.arange`. Check that 50 is part of the array.
##### The question 2 is validated if the generated array is based on an iterator as `range` or `np.arange`. Check that 100 is part of the array.
##### The question 3 is validated if you concatenated this way `np.concatenate(array1,array2)`.
##### The question 4 is validated if the result is:
```console
array([[ 1, ... , 10],
...
[ 91, ... , 100]])
```
The easiest way is to use `array.reshape(10,10)`.
https://jakevdp.github.io/PythonDataScienceHandbook/ (section: The Basics of NumPy Arrays)

17
piscine/week01/day01/ex05/readme.md

@ -0,0 +1,17 @@
# Exercise 5: Split, concatenate, reshape arrays
The goal of this exercise is to learn to concatenate and reshape arrays.
1. Generate an array with integers from 1 to 50: `array([1,...,50])`
2. Generate an array with integers from 51 to 100: `array([51,...,100])`
3. Using `np.concatenate`, concatenate the two arrays into: `array([1,...,100])`
4. Reshape the previous array into:
```console
array([[ 1, ... , 10],
...
[ 91, ... , 100]])
```

28
piscine/week01/day01/ex06/audit/readme.md

@ -0,0 +1,28 @@
##### The exercice is validated is all questions of the exercice are validated
##### The question 1 is validated if the output is the same as:
`np.ones([9,9], dtype=np.int8)`
##### The question 2 is validated if the output is
```console
array([[1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 0, 0, 0, 0, 0, 0, 0, 1],
[1, 0, 1, 1, 1, 1, 1, 0, 1],
[1, 0, 1, 0, 0, 0, 1, 0, 1],
[1, 0, 1, 0, 1, 0, 1, 0, 1],
[1, 0, 1, 0, 0, 0, 1, 0, 1],
[1, 0, 1, 1, 1, 1, 1, 0, 1],
[1, 0, 0, 0, 0, 0, 0, 0, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int8)
```
##### The solution of question 2 is not accepted if the values of the array have been changed one by one manually. The usage of the for loop is not allowed neither.
Here is an example of a possible solution:
```console
x[1:8,1:8] = 0
x[2:7,2:7] = 1
x[3:6,3:6] = 0
x[4,4] = 1
```

20
piscine/week01/day01/ex06/readme.md

@ -0,0 +1,20 @@
# Exercise 6: Broadcasting and Slicing
The goal of this exercise is to learn to access values of n-dimensional arrays efficiently.
1. Create an 2-dimensional array size 9,9 of 1s. Each value has to be an `int8`.
2. Using **slicing**, output this array:
```python
array([[1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 0, 0, 0, 0, 0, 0, 0, 1],
[1, 0, 1, 1, 1, 1, 1, 0, 1],
[1, 0, 1, 0, 0, 0, 1, 0, 1],
[1, 0, 1, 0, 1, 0, 1, 0, 1],
[1, 0, 1, 0, 0, 0, 1, 0, 1],
[1, 0, 1, 1, 1, 1, 1, 0, 1],
[1, 0, 0, 0, 0, 0, 0, 0, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int8)
```
https://jakevdp.github.io/PythonDataScienceHandbook/ (section: Computation on Arrays: Broadcasting)

32
piscine/week01/day01/ex07/audit/readme.md

@ -0,0 +1,32 @@
##### The exercice is validated is all questions of the exercice are validated
##### This question is validated if, without having used a for loop or having filled the array manually, the output is:
```console
[[ 7. 1. 7.]
[nan 2. 2.]
[nan 8. 8.]
[ 9. 3. 9.]
[ 8. 9. 8.]
[nan 2. 2.]
[ 8. 2. 8.]
[nan 6. 6.]
[ 9. 2. 9.]
[ 8. 5. 8.]]
```
There are two steps in this exercise:
- Create the vector that contains the grade of the first exam if available or the second. This can be done using `np.where`:
```python
np.where(np.isnan(grades[:, 0]), grades[:, 1], grades[:, 0])
```
- Add this vector as third column of the array. Here are two ways:
```python
np.insert(arr = grades, values = new_vector, axis = 1, obj = 2)
np.hstack((grades, new_vector[:, None]))
```

18
piscine/week01/day01/ex07/readme.md

@ -0,0 +1,18 @@
# Exercise 7: NaN
The goal of this exercise is to learn to deal with missing data in NumPy and to manipulate NumPy arrays.
Let us consider a 2-dimensional array that contains the grades at the past two exams. Some of the students missed the first exam. As the grade is missing it has been replaced with a `NaN`.
1. Using `np.where` create a third column that is equal to the grade of the first exam if it exists and the second else. Add the column as the third column of the array.
**Using a for loop or if/else statement is not allowed in this exercise.**
```python
import numpy as np
generator = np.random.default_rng(123)
grades = np.round(generator.uniform(low = 0.0, high = 10.0, size = (10, 2)))
grades[[1,2,5,7], [0,0,0,0]] = np.nan
print(grades)
```

52
piscine/week01/day01/ex08/audit/readme.md

@ -0,0 +1,52 @@
1. This question is validated if the text file has successfully been loaded in a NumPy array with
`genfromtxt('winequality-red.csv', delimiter=',')` and the reduced arrays weights **76800 bytes**
2. This question is validated if the output is
```python
array([[ 7.4 , 0.7 , 0. , 1.9 , 0.076 , 11. , 34. ,
0.9978, 3.51 , 0.56 , 9.4 , 5. ],
[ 7.4 , 0.66 , 0. , 1.8 , 0.075 , 13. , 40. ,
0.9978, 3.51 , 0.56 , 9.4 , 5. ],
[ 6.7 , 0.58 , 0.08 , 1.8 , 0.097 , 15. , 65. ,
0.9959, 3.28 , 0.54 , 9.2 , 5. ]])
```
This slicing gives the answer `my_data[[1,6,11],:]`.
3. This question is validated if the answer if False. There many ways to get the answer: find the maximum or check values greater than 20.
4. This question is validated if the answer is 10.422983114446529.
5. This question is validated if the answers is:
```console
pH stats
25 percentile: 3.21
50 percentile: 3.31
75 percentile: 3.4
mean: 3.3111131957473416
min: 2.74
max: 4.01
```
> *Note: Using `percentile` or `median` may give different results depending on the duplicate values in the column. If you do not have my results please use `percentile`.*
6. This question is validated if the answer is ~`5.2`. The first step is to get the percentile 20% of the column `sulphates`, then create a boolean array that contains `True` of the value is smaller than the percentile 20%, then select this rows with the column quality and compute the `mean`.
7. This question is validated if the output for the best wines is:
```python
array([ 8.56666667, 0.42333333, 0.39111111, 2.57777778, 0.06844444,
13.27777778, 33.44444444, 0.99521222, 3.26722222, 0.76777778,
12.09444444, 8. ])
```
And the output for the bad wines is:
```python
array([ 8.36 , 0.8845 , 0.171 , 2.635 , 0.1225 , 11. ,
24.9 , 0.997464, 3.398 , 0.57 , 9.955 , 3. ])
```
This can be done in three steps: Get the max, create a boolean mask that indicates rows with max quality, use this mask to subset the rows with the best quality and compute the mean on the axis 0.

1600
piscine/week01/day01/ex08/data/winequality-red.csv

File diff suppressed because it is too large diff.load

72
piscine/week01/day01/ex08/data/winequality.names

@ -0,0 +1,72 @@
Citation Request:
This dataset is public available for research. The details are described in [Cortez et al., 2009].
Please include this citation if you plan to use this database:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016
[Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf
[bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib
1. Title: Wine Quality
2. Sources
Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009
3. Past Usage:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
In the above reference, two datasets were created, using red and white wine samples.
The inputs include objective tests (e.g. PH values) and the output is based on sensory data
(median of at least 3 evaluations made by wine experts). Each expert graded the wine quality
between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model
these datasets under a regression approach. The support vector machine model achieved the
best results. Several metrics were computed: MAD, confusion matrix for a fixed error tolerance (T),
etc. Also, we plot the relative importances of the input variables (as measured by a sensitivity
analysis procedure).
4. Relevant Information:
The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine.
For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009].
Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables
are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
These datasets can be viewed as classification or regression tasks.
The classes are ordered and not balanced (e.g. there are munch more normal wines than
excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent
or poor wines. Also, we are not sure if all input variables are relevant. So
it could be interesting to test feature selection methods.
5. Number of Instances: red wine - 1599; white wine - 4898.
6. Number of Attributes: 11 + output attribute
Note: several of the attributes may be correlated, thus it makes sense to apply some sort of
feature selection.
7. Attribute information:
For more information, read [Cortez et al., 2009].
Input variables (based on physicochemical tests):
1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
8. Missing Attribute Values: None

24
piscine/week01/day01/ex08/readme.md

@ -0,0 +1,24 @@
# Exercise 8: Wine
The goal of this exercise is to learn to perform a basic data analysis on real data using NumPy.
The data set that will be used for this exercise is the red wine data set.
https://archive.ics.uci.edu/ml/datasets/wine+quality
How to tell if a given 2D array has null columns?
1. Using `genfromtxt` load the data and reduce the size of the numpy array by optimizing the types. The sum of absolute differences between the original data set and the "memory" optimized one has to be smaller than 1.10**-3. I suggest to use `np.float32`. Check that the numpy array weights **76800 bytes**.
2. Print 2nd, 7th and 12th rows as a two dimensional array
3. Is there any wine with a percentage of alcohol greater than 20% ? Return True or False
4. What is the average % of alcohol on all wines in the data set ? If needed, drop `np.nan` values
5. Compute the minimum, the maximum, the 25th percentile, the 50th percentile, the 75th percentile, the median (50th percentile) of the pH
6. Compute the average quality of the wines having the 20% least sulphates
7. Compute the mean of all variables for wines having the best quality. Same question for the wines having the worst quality

6
piscine/week01/day01/ex09/audit/readme.md

@ -0,0 +1,6 @@
This exercise is validated if the output is:
```console
[[0 3 1 2 4]
[7 6 8 9 5]]
```

10
piscine/week01/day01/ex09/data/model_forecasts.txt

@ -0,0 +1,10 @@
nan -9.480000000000000426e+00 1.415000000000000036e+01 1.126999999999999957e+01 -5.650000000000000355e+00 3.330000000000000071e+00 1.094999999999999929e+01 -2.149999999999999911e+00 5.339999999999999858e+00 -2.830000000000000071e+00
9.480000000000000426e+00 nan 4.860000000000000320e+00 -8.609999999999999432e+00 7.820000000000000284e+00 -1.128999999999999915e+01 1.324000000000000021e+01 4.919999999999999929e+00 2.859999999999999876e+00 9.039999999999999147e+00
-1.415000000000000036e+01 -1.126999999999999957e+01 nan 1.227999999999999936e+01 -2.410000000000000142e+00 6.040000000000000036e+00 -5.160000000000000142e+00 -3.870000000000000107e+00 -1.281000000000000050e+01 1.790000000000000036e+00
5.650000000000000355e+00 -3.330000000000000071e+00 -1.094999999999999929e+01 nan -1.364000000000000057e+01 0.000000000000000000e+00 2.240000000000000213e+00 -3.609999999999999876e+00 -7.730000000000000426e+00 8.000000000000000167e-02
2.149999999999999911e+00 -5.339999999999999858e+00 2.830000000000000071e+00 -4.860000000000000320e+00 nan -8.800000000000000044e-01 -8.570000000000000284e+00 2.560000000000000053e+00 -7.030000000000000249e+00 -6.330000000000000071e+00
8.609999999999999432e+00 -7.820000000000000284e+00 1.128999999999999915e+01 -1.324000000000000021e+01 -4.919999999999999929e+00 nan -1.296000000000000085e+01 -1.282000000000000028e+01 -1.403999999999999915e+01 1.456000000000000050e+01
-2.859999999999999876e+00 -9.039999999999999147e+00 -1.227999999999999936e+01 2.410000000000000142e+00 -6.040000000000000036e+00 5.160000000000000142e+00 nan -1.091000000000000014e+01 -1.443999999999999950e+01 -1.372000000000000064e+01
3.870000000000000107e+00 1.281000000000000050e+01 -1.790000000000000036e+00 1.364000000000000057e+01 -0.000000000000000000e+00 -2.240000000000000213e+00 3.609999999999999876e+00 nan 1.053999999999999915e+01 -1.417999999999999972e+01
7.730000000000000426e+00 -8.000000000000000167e-02 8.800000000000000044e-01 8.570000000000000284e+00 -2.560000000000000053e+00 7.030000000000000249e+00 6.330000000000000071e+00 1.296000000000000085e+01 nan -1.169999999999999929e+01
1.282000000000000028e+01 1.403999999999999915e+01 -1.456000000000000050e+01 1.091000000000000014e+01 1.443999999999999950e+01 1.372000000000000064e+01 -1.053999999999999915e+01 1.417999999999999972e+01 1.169999999999999929e+01 nan

26
piscine/week01/day01/ex09/readme.md

@ -0,0 +1,26 @@
## Exercise 9 Football tournament
The goal of this exercise is to learn to use permutations, complex
A Football tournament is organized in your city. There are 10 teams and the director of the tournaments wants you to create a first round as exciting as possible. To do so, you are allowed to choose the pairs. As a former data scientist, you implemented a model based on teams' current season performance. This models predicts the score difference between two teams. You used this algorithm to predict the score difference for every possible pair.
The matrix returned is a 2-dimensional array that contains in (i,j) the score difference between team i and j. The matrix is in `model_forecasts.txt`.
Using this output, what are the pairs that will give the most interesting matches ?
If a team wins 7-1 the match is obviously less exciting than a match where the winner wins 2-1.
The criteria that corresponds to **the pairs that will give the most interesting matches** is **the pairs that minimize the sum of squared differences**
The expected output is:
```console
[[m1_t1 m2_t1 m3_t1 m4_t1 m5_t1]
[m1_t2 m2_t2 m3_t2 m4_t2 m5_t2]]
```
- m1_t1 stands for match1_team1
- m1_t1 plays against m1_t2 ...
**Usage of for loop is not allowed, you may need to use the library** `itertools` **to create permutations**
https://docs.python.org/3.9/library/itertools.html

31
piscine/week01/day01/readme.md

@ -0,0 +1,31 @@
# W1D01 Piscine AI - Data Science
## NumPy
The goal of this day is to understand practical usage of **NumPy**. **NumPy** is a commonly used Python data analysis package. By using **NumPy**, you can speed up your workflow, and interface with other packages in the Python ecosystem, like scikit-learn, that use **NumPy** under the hood. **NumPy** was originally developed in the mid 2000s, and arose from an even older package called Numeric. This longevity means that almost every data analysis or machine learning package for Python leverages **NumPy** in some way.
## Exercises of the day
- Exercise 0 Environment and libraries
- Exercise 1 Your first NumPy array
- Exercise 2 Zeros
- Exercise 3 Slicing
- Exercise 4 Random
- Exercise 5 Split, concatenate, reshape arrays
- Exercise 6 Broadcasting and Slicing
- Exercise 7 NaN
- Exercise 8 Wine
- Exercise 9 Football tournament
## Virtual Environment
- Python 3.x
- NumPy
- Jupyter or JupyterLab
*Version of NumPy I used to do the exercises: 1.18.1*.
I suggest to use the most recent one.
## Ressources
- https://medium.com/fintechexplained/why-should-we-use-NumPy-c14a4fb03ee9
- https://numpy.org/doc/
- https://jakevdp.github.io/PythonDataScienceHandbook/

9
piscine/week01/day02/ex00/audit/readme.md

@ -0,0 +1,9 @@
##### The exercice is validated is all questions of the exercice are validated
##### Activate the virtual environment. If you used `conda` run `conda activate your_env`
##### Run `python --version`
###### Does it print `Python 3.x`? x >= 8
##### Does `import jupyter`, `import numpy` and `import pandas` run without any error ?

64
piscine/week01/day02/ex00/readme.md

@ -0,0 +1,64 @@
# W1D02 Piscine AI - Data Science
## Pandas
The goal of this day is to understand practical usage of **Pandas**.
As **Pandas** in intensively used in Data Science, other days of the piscine will be dedicated to it.
Not only is the **Pandas** library a central component of the data science toolkit but it is used in conjunction with other libraries in that collection.
**Pandas** is built on top of the NumPy package, meaning a lot of the structure of NumPy is used or replicated in **Pandas**. Data in **Pandas** is often used to feed statistical analysis in SciPy, plotting functions from Matplotlib, and machine learning algorithms in Scikit-learn.
Most of the topics we will cover today are explained and describes with examples in the first resource. The number of exercises is low on purpose: Take the time to understand the chapter 5 of the resource, even if there are 40 pages.
## Exercises of the day
- Exercice 0 Environment and libraries
- Exercise 1 Your first DataFrame
- Exercise 2 Electric power consumption
- Exercise 3 E-commerce purchases
- Exercise 4 Handling missing values
## Virtual Environment
- Python 3.x
- NumPy
- Pandas
- Jupyter or JupyterLab
*Version of Pandas I used to do the exercises: 1.0.1*.
I suggest to use the most recent one.
## Resources
- If I had to give you one resource it would be this one:
https://bedford-computing.co.uk/learning/wp-content/uploads/2015/10/Python-for-Data-Analysis.pdf
It contains ALL you need to know about Pandas.
- Pandas documentation:
- https://pandas.pydata.org/docs/
- https://jakevdp.github.io/PythonDataScienceHandbook/
- https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
- https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/
- https://jakevdp.github.io/PythonDataScienceHandbook/03.04-missing-values.html
# Exercise 0 Environment and libraries
The goal of this exercise is to set up the Python work environment with the required libraries.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
I recommend to use:
- the **last stable versions** of Python.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy` and `jupyter`.

17
piscine/week01/day02/ex01/audit/readme.md

@ -0,0 +1,17 @@
##### The exercice is validated is all questions of the exercice are validated
##### The solution of question 1 is accepted if the DataFrame created is the same as the "model" DataFrame. Check that the index is not 1,2,3,4,5.
##### The solution of question 2 is accepted if the types you get for the columns are as below and if if the types of the first value of the columns are as below
```console
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
```
```console
<class 'str'>
<class 'list'>
<class 'float'>
```

17
piscine/week01/day02/ex01/readme.md

@ -0,0 +1,17 @@
# Exercice 1
The goal of this exercise is to learn to create basic Pandas objects.
1. Create a DataFrame as below this using two ways:
- From a NumPy array
- From a Pandas Series
| | color | list | number |
|---:|:--------|:--------|---------:|
| 1 | Blue | [1, 2] | 1.1 |
| 3 | Red | [3, 4] | 2.2 |
| 5 | Pink | [5, 6] | 3.3 |
| 7 | Grey | [7, 8] | 4.4 |
| 9 | Black | [9, 10] | 5.5 |
2. Print the types for every columns and the types of the first value of every columns

101
piscine/week01/day02/ex02/audit/readme.md

@ -0,0 +1,101 @@
##### The exercice is validated is all questions of the exercice are validated
##### The solution of question 1 is accepted if you use `drop` with `axis=1`.`inplace=True` may be useful to avoid to affect the result to a variable. A solution that could be accepted too (even if it's not a solution I recommend is `del`.
##### The solution of question 2 is accepted if the DataFrame returns the output below. If the type of the index is not `dtype='datetime64[ns]'` the solution is not accepted. I recommend to use `set_index` with `inplace=True` to do so.
```python
Input: df.head().index
Output:
DatetimeIndex(['2006-12-16', '2006-12-16','2006-12-16', '2006-12-16','2006-12-16'],
dtype='datetime64[ns]', name='Date', freq=None)
```
##### The solution of question 3 is accepted if all the types are `float64` as below. The preferred solution is `pd.to_numeric` with `coerce=True`.
```python
Input: df.dtypes
Output:
Global_active_power float64
Global_reactive_power float64
Voltage float64
Global_intensity float64
Sub_metering_1 float64
dtype: object
```
##### The solution of question 4 is accepted if you use `df.describe()`.
##### The solution of question 5 is accepted if you used `dropna` and have the number of missing values equal to 0.You should have noticed that 25979 rows contain missing values (for a total of 129895). `df.isna().sum()` allows to check the number of missing values and `df.dropna()` with `inplace=True` allows to remove the rows with missing values.
##### The solution of question 6 is accepted if one of the two approaches below were used:
```python
#solution 1
df.loc[:,'A'] = (df['A'] + 1) * 0.06
#solution 2
df.loc[:,'A'] = df.loc[:,'A'].apply(lambda x: (x+1)*0.06)
```
You may wonder `df.loc[:,'A']` is required and if `df['A'] = ...` works too. **The answer is no**. This is important in Pandas. Depending on the version of Pandas, it may return a warning. The reason is that you are affecting a value to a **copy** of the DataFrame and not in the DataFrame.
More details: https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas
##### The solution of question 7 is accepted as long as the output of `print(filtered_df.head().to_markdown())` is as below and if the number of rows is equal to **449667**.
| Date | Global_active_power | Global_reactive_power |
|:--------------------|----------------------:|------------------------:|
| 2008-12-27 00:00:00 | 0.996 | 0.066 |
| 2008-12-27 00:00:00 | 1.076 | 0.162 |
| 2008-12-27 00:00:00 | 1.064 | 0.172 |
| 2008-12-27 00:00:00 | 1.07 | 0.174 |
| 2008-12-27 00:00:00 | 0.804 | 0.184 |
##### The solution of question 8 is accepted if the output is
```console
Global_active_power 0.254
Global_reactive_power 0.000
Voltage 238.350
Global_intensity 1.200
Sub_metering_1 0.000
Name: 2007-02-16 00:00:00, dtype: float64
```
##### The solution of question 9 if the output is `Timestamp('2009-02-22 00:00:00')`
##### The solution of question 10 if the output of `print(sorted_df.tail().to_markdown())` is
| Date | Global_active_power | Global_reactive_power | Voltage |
|:--------------------|----------------------:|------------------------:|----------:|
| 2008-08-28 00:00:00 | 0.076 | 0 | 234.88 |
| 2008-08-28 00:00:00 | 0.076 | 0 | 235.18 |
| 2008-08-28 00:00:00 | 0.076 | 0 | 235.4 |
| 2008-08-28 00:00:00 | 0.076 | 0 | 235.64 |
| 2008-12-08 00:00:00 | 0.076 | 0 | 236.5 |
##### The solution of question 11 is accepted if the output is as below. The solution is based on `groupby` which creates groups based on the index `Date` and aggregates the groups using the `mean`.
```console
Date
2006-12-16 3.053475
2006-12-17 2.354486
2006-12-18 1.530435
2006-12-19 1.157079
2006-12-20 1.545658
...
2010-12-07 0.770538
2010-12-08 0.367846
2010-12-09 1.119508
2010-12-10 1.097008
2010-12-11 1.275571
Name: Global_active_power, Length: 1433, dtype: float64
```

1
piscine/week01/day02/ex02/data/household_power_consumption.txt

@ -0,0 +1 @@
Empty file. The original is too big to be pushed on Github.

24
piscine/week01/day02/ex02/readme.md

@ -0,0 +1,24 @@
# Exercise 2 **Electric power consumption**
The goal of this exercise is to learn to manipulate real data with Pandas.
The data set used is **Individual household electric power consumption**
1. Delete the columns `Time`, `Sub_metering_2` and `Sub_metering_3`
2. Set `Date` as index
3. Create a function that takes as input the DataFrame with the data set and returns a DataFrame with updated types:
```python
def update_types(df):
#TODO
return df
```
4. Use `describe` to have an overview on the data set
5. Delete the rows with missing values
6. Modify `Sub_metering_1` by adding 1 to it and multiplying the total by 0.06. If x is a row the output is: (x+1)*0.06
7. Select all the rows for which the Date is greater or equal than 2008-12-27 and `Voltage` is greater or equal than 242
8. Print the 88888th row.
9. What is the date for which the `Global_active_power` is maximal ?
10. Sort the first three columns by descending order of `Global_active_power` and ascending order of `Voltage`.
11. Compute the daily average of `Global_active_power`.

49
piscine/week01/day02/ex03/audit/readme.md

@ -0,0 +1,49 @@
##### The exercice is validated is all questions of the exercice are validated
##### To validate this exercise all answers should return the expected numerical value given in the correction AND uses Pandas. For example using NumPy to compute the mean doesn't respect the philosophy of the exercise which is to use Pandas.
#### The solution of question 1 is accepted if it contains **10000 entries** and **14 columns**. There many solutions based on: shape, info, describe.
##### The solution of question 2 is accepted if the answer is **50.34730200000025**.
Even if `np.mean` gives the solution, `df['Purchase Price'].mean()` is preferred
##### The solution of question 3 is accepted if the min is `0`and the max is `99.989999999999995`
##### The solution of question 4 is accepted if the answer is **1098**
##### The solution of question 5 is accepted if the answer is **30**
##### The solution of question 6 is accepted if the are `4932` people that made the purchase during the `AM` and `5068` people that made the purchase during `PM`. There many ways to the solution but the goal of this question was to make you use `value_counts`
##### The solution of question 7 is accepted if the answer is as below. There many ways to the solution but the goal of this question was to make you use `value_counts`
Interior and spatial designer 31
Lawyer 30
Social researcher 28
Purchasing manager 27
Designer, jewellery 27
8. ##### The solution of question 8 is accepted if the purchase price is **75.1**
##### The solution of question 9 is accepted if the email adress is **bondellen@williams-garza.com**
##### The solution of question 10 is accepted if the answer is **39**. The prefered solution is based on this: `df[(df['A'] == X) & (df['B'] > Y)]`
##### The solution of question 11 is accepted if the answer is **1033**. The preferred solution is based on the usage of `apply` on a `lambda` function that slices the string that contains the expiration date.
##### The solution of question 12 is accepted if the answer is as below. The preferred solution is based on the usage of `apply` on a `lambda` function that slices the string that contains the email. The `lambda` function uses `split` to split the string on `@`. Finally, `value_counts` is used to count the occurrences.
- hotmail.com 1638
- yahoo.com 1616
- gmail.com 1605
- smith.com 42
- williams.com 37

20001
piscine/week01/day02/ex03/data/Ecommerce_purchases.txt

File diff suppressed because it is too large diff.load

20
piscine/week01/day02/ex03/readme.md

@ -0,0 +1,20 @@
# Exercice 3: E-commerce purchases<w>
The goal of this exercise is to learn to manipulate real data with Pandas. This exercise is less guided since the exercise 2 should have given you a nice introduction.
The data set used is **E-commerce purchases**.
Questions:
1. How many rows and columns are there?
2. What is the average Purchase Price?
3. What were the highest and lowest purchase prices?
4. How many people have English `'en'` as their Language of choice on the website?
5. How many people have the job title of `"Lawyer"` ?
6. How many people made the purchase during the `AM` and how many people made the purchase during `PM` ?
7. What are the 5 most common Job Titles?
8. Someone made a purchase that came from Lot: `"90 WT"` , what was the Purchase Price for this transaction?
9. What is the email of the person with the following Credit Card Number: `4926535242672853`
10. How many people have American Express as their Credit Card Provider and made a purchase above `$95` ?
11. How many people have a credit card that expires in `2025`?
12. What are the top 5 most popular email providers/hosts (e.g. gmail.com, yahoo.com, etc...)

32
piscine/week01/day02/ex04/audit/readme.md

@ -0,0 +1,32 @@
##### The exercice is validated is all questions of the exercice are validated (except the bonus question)
##### The solution of question 1 is accepted if you have done these two steps in that order. First, convert the numerical columns to `float` and then fill the missing values. The first step may involve `pd.to_numeric(df.loc[:,col], errors='coerce')`. The second step is validated if you eliminated all missing values. However there are many possibilities to fill the missing values. Here is one of them:
example:
```python
df.fillna({0:df.sepal_length.mean(),
2:df.sepal_width.median(),
3:0,
4:0})
```
##### The solution of question 2 is accepted if the solution is `df.loc[:,col].fillna(df[col].median())`.
##### The solution of bonus question is accepted if you find out this answer: Once we filled the missing values as suggested in the first question, `df.describe()` returns this interesting summary. We notice that the mean is way higher than the median. It means that there are maybe some outliers in the data. The quantile 75 and the max confirm that: 75% of the flowers have a sepal length smaller than 6.4 cm, but the max is 6900 cm. If you check on the internet you realise this small flower can't be that big. The outliers have a major impact on the mean which equals to 56.9. Filling this value for the missing value is not correct since it doesn't correspond to the real size of this flower. That is why in that case the best strategy to fill the missing values is the median. The truth is that I modified the data set ! But real data sets ALWAYS contains outliers. Always think about the meaning of the data transformation ! If you fill the missing values by zero, it means that you consider that the length or width of some flowers may be 0. It doesn't make sense.
| | sepal_length | sepal_width | petal_length | petal_width |
|:------|---------------:|--------------:|---------------:|--------------:|
| count | 146 | 141 | 120 | 147 |
| mean | 56.9075 | 52.6255 | 15.5292 | 12.0265 |
| std | 572.222 | 417.127 | 127.46 | 131.873 |
| min | -4.4 | -3.6 | -4.8 | -2.5 |
| 25% | 5.1 | 2.8 | 2.725 | 0.3 |
| 50% | 5.75 | 3 | 4.5 | 1.3 |
| 75% | 6.4 | 3.3 | 5.1 | 1.8 |
| max | 6900 | 3809 | 1400 | 1600 |
##### The solution of bonus question is accepted if you noticed that there are some negative values and the huge values, you will be a good data scientist. **YOU SHOULD ALWAYS TRY TO UNDERSTAND YOUR DATA**. Print the row with index 122 ;-) This week, we will have the opportunity to focus on the data pre-processing to understand how the outliers can be handled.

151
piscine/week01/day02/ex04/data/iris.csv

@ -0,0 +1,151 @@
,sepal_length,sepal_width,petal_length,petal_width, flower
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,-3.6,-1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,-4.4,2.9,1400.0,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa
10,5.4,3.7,,0.2,Iris-setosa
11,4.8,3.4,,0.2,Iris-setosa
12,4.8,3.0,,0.1,Iris-setosa
13,4.3,3.0,,0.1,Iris-setosa
14,5.8,4.0,,0.2,Iris-setosa
15,5.7,4.4,,0.4,Iris-setosa
16,5.4,3.9,,0.4,Iris-setosa
17,5.1,3.5,,0.3,Iris-setosa
18,5.7,3.8,,0.3,Iris-setosa
19,5.1,3.8,,0.3,Iris-setosa
20,5.4,3.4,,0.2,Iris-setosa
21,5.1,3.7,,0.4,Iris-setosa
22,4.6,3.6,,0.2,Iris-setosa
23,5.1,3.3,,0.5,Iris-setosa
24,4.8,3.4,,0.2,Iris-setosa
25,5.0,-3.0,,0.2,Iris-setosa
26,5.0,3.4,,0.4,Iris-setosa
27,5.2,3.5,,0.2,Iris-setosa
28,5.2,3.4,,0.2,Iris-setosa
29,4.7,3.2,,0.2,Iris-setosa
30,4.8,3.1,1.6,0.2,Iris-setosa
31,5.4,3.4,1.5,0.4,Iris-setosa
32,5.2,4.1,1.5,0.1,Iris-setosa
33,5.5,4.2,1.4,0.2,Iris-setosa
34,4.9,3.1,1.5,0.1,Iris-setosa
35,5.0,3.2,1.2,0.2,Iris-setosa
36,5.5,3.5,1.3,0.2,Iris-setosa
37,4.9,,1.5,0.1,Iris-setosa
38,4.4,3.0,1.3,0.2,Iris-setosa
39,5.1,3.4,1.5,0.2,Iris-setosa
40,5.0,3.5,1.3,0.3,Iris-setosa
41,4.5,2.3,1.3,0.3,Iris-setosa
42,4.4,3.2,1.3,0.2,Iris-setosa
43,5.0,3.5,1.6,0.6,Iris-setosa
44,5.1,3.8,1.9,0.4,Iris-setosa
45,4.8,3.0,1.4,0.3,Iris-setosa
46,5.1,3809.0,1.6,0.2,Iris-setosa
47,4.6,3.2,1.4,0.2,Iris-setosa
48,5.3,3.7,1.5,0.2,Iris-setosa
49,5.0,3.3,1.4,0.2,Iris-setosa
50,7.0,3.2,4.7,1.4,Iris-versicolor
51,6.4,3200.0,4.5,1.5,Iris-versicolor
52,6.9,3.1,4.9,1.5,Iris-versicolor
53,5.5,2.3,4.0,1.3,Iris-versicolor
54,6.5,2.8,4.6,1.5,Iris-versicolor
55,5.7,2.8,4.5,1.3,Iris-versicolor
56,6.3,3.3,4.7,1600.0,Iris-versicolor
57,4.9,2.4,3.3,1.0,Iris-versicolor
58,6.6,2.9,4.6,1.3,Iris-versicolor
59,5.2,2.7,3.9,,Iris-versicolor
60,5.0,2.0,3.5,1.0,Iris-versicolor
61,5.9,3.0,4.2,1.5,Iris-versicolor
62,6.0,2.2,4.0,1.0,Iris-versicolor
63,6.1,2.9,4.7,1.4,Iris-versicolor
64,5.6,2.9,3.6,1.3,Iris-versicolor
65,6.7,3.1,4.4,1.4,Iris-versicolor
66,5.6,3.0,4.5,1.5,Iris-versicolor
67,5.8,2.7,4.1,1.0,Iris-versicolor
68,6.2,2.2,4.5,1.5,Iris-versicolor
69,5.6,2.5,3.9,1.1,Iris-versicolor
70,5.9,3.2,4.8,1.8,Iris-versicolor
71,6.1,2.8,4.0,1.3,Iris-versicolor
72,6.3,2.5,4.9,1.5,Iris-versicolor
73,6.1,2.8,4.7,1.2,Iris-versicolor
74,6.4,2.9,4.3,1.3,Iris-versicolor
75,6.6,3.0,4.4,1.4,Iris-versicolor
76,6.8,2.8,4.8,1.4,Iris-versicolor
77,6.7,3.0,5.0,1.7,Iris-versicolor
78,6.0,2.9,4.5,1.5,Iris-versicolor
79,5.7,2.6,3.5,1.0,Iris-versicolor
80,5.5,2.4,3.8,1.1,Iris-versicolor
81,5.5,2.4,3.7,1.0,Iris-versicolor
82,5.8,2.7,3.9,1.2,Iris-versicolor
83,6.0,2.7,5.1,1.6,Iris-versicolor
84,5.4,3.0,4.5,1.5,Iris-versicolor
85,6.0,3.4,4.5,1.6,Iris-versicolor
86,6.7,3.1,4.7,1.5,Iris-versicolor
87,6.3,2.3,4.4,1.3,Iris-versicolor
88,5.6,3.0,4.1,1.3,Iris-versicolor
89,5.5,2.5,4.0,1.3,Iris-versicolor
90,5.5,2.6,4.4,1.2,Iris-versicolor
91,6.1,3.0,4.6,1.4,Iris-versicolor
92,5.8,2.6,4.0,1.2,Iris-versicolor
93,5.0,2.3,3.3,1.0,Iris-versicolor
94,5.6,2.7,4.2,1.3,Iris-versicolor
95,5.7,3.0,4.2,1.2,Iris-versicolor
96,5.7,2.9,4.2,1.3,Iris-versicolor
97,6.2,2.9,4.3,1.3,Iris-versicolor
98,5.1,2.5,3.0,1.1,Iris-versicolor
99,5.7,2.8,,1.3,Iris-versicolor
100,,3.3,,2.5,Iris-virginica
101,5.8,2.7,,1.9,Iris-virginica
102,7.1,3.0,,2.1,Iris-virginica
103,6.3,2.9,,1.8,Iris-virginica
104,6.5,3.0,,2.2,Iris-virginica
105,7.6,3.0,6.6,2.1,Iris-virginica
106,4.9,2.5,4.5,1.7,Iris-virginica
107,7.3,2.9,6.3,1.8,Iris-virginica
108,6.7,2.5,5.8,1.8,Iris-virginica
109,7.2,3.6,6.1,2.5,Iris-virginica
110,6.5,3.2,5.1,2.0,Iris-virginica
111,6.4,2.7,5.3,1.9,Iris-virginica
112,6.8,3.0,5.5,2.1,Iris-virginica
113,5.7,2.5,5.0,2.0,Iris-virginica
114,5.8,,5.1,2.4,Iris-virginica
115,6.4,,5.3,2.3,Iris-virginica
116,6.5,,5.5,1.8,Iris-virginica
117,7.7,,6.7,2.2,Iris-virginica
118,7.7,,,2.3,Iris-virginica
119,6.0,,5.0,1.5,Iris-virginica
120,6.9,,5.7,2.3,Iris-virginica
121,5.6,2.8,4.9,2.0,Iris-virginica
122,always,check,the,data,!!!!!!!!
123,6.3,2.7,4.9,1.8,Iris-virginica
124,6.7,3.3,5.7,2.1,Iris-virginica
125,7.2,3.2,6.0,1.8,Iris-virginica
126,6.2,2.8,-4.8,1.8,Iris-virginica
127,,3.0,4.9,1.8,Iris-virginica
128,6.4,2.8,5.6,2.1,Iris-virginica
129,7.2,3.0,5.8,1.6,Iris-virginica
130,7.4,2.8,6.1,1.9,Iris-virginica
131,7.9,3.8,6.4,2.0,Iris-virginica
132,6.-4,2.8,5.6,2.2,Iris-virginica
133,6.3,2.8,,1.5,Iris-virginica
134,6.1,2.6,5.6,1.4,Iris-virginica
135,7.7,3.0,6.1,2.3,Iris-virginica
136,6.3,3.4,5.6,2.4,Iris-virginica
137,6.4,3.1,5.5,1.8,Iris-virginica
138,6.0,3.0,4.8,1.8,Iris-virginica
139,6900,3.1,5.4,2.1,Iris-virginica
140,6.7,3.1,,2.4,Iris-virginica
141,6.9,3.1,5.1,2.3,Iris-virginica
142,580,2.7,5.1,,Iris-virginica
143,6.8,3.2,5.9,2.3,Iris-virginica
144,6.7,3.3,5.7,-2.5,Iris-virginica
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica
1 sepal_length sepal_width petal_length petal_width flower
2 0 5.1 3.5 1.4 0.2 Iris-setosa
3 1 4.9 3.0 1.4 0.2 Iris-setosa
4 2 4.7 3.2 1.3 0.2 Iris-setosa
5 3 4.6 3.1 1.5 0.2 Iris-setosa
6 4 5.0 -3.6 -1.4 0.2 Iris-setosa
7 5 5.4 3.9 1.7 0.4 Iris-setosa
8 6 4.6 3.4 1.4 0.3 Iris-setosa
9 7 5.0 3.4 1.5 0.2 Iris-setosa
10 8 -4.4 2.9 1400.0 0.2 Iris-setosa
11 9 4.9 3.1 1.5 0.1 Iris-setosa
12 10 5.4 3.7 0.2 Iris-setosa
13 11 4.8 3.4 0.2 Iris-setosa
14 12 4.8 3.0 0.1 Iris-setosa
15 13 4.3 3.0 0.1 Iris-setosa
16 14 5.8 4.0 0.2 Iris-setosa
17 15 5.7 4.4 0.4 Iris-setosa
18 16 5.4 3.9 0.4 Iris-setosa
19 17 5.1 3.5 0.3 Iris-setosa
20 18 5.7 3.8 0.3 Iris-setosa
21 19 5.1 3.8 0.3 Iris-setosa
22 20 5.4 3.4 0.2 Iris-setosa
23 21 5.1 3.7 0.4 Iris-setosa
24 22 4.6 3.6 0.2 Iris-setosa
25 23 5.1 3.3 0.5 Iris-setosa
26 24 4.8 3.4 0.2 Iris-setosa
27 25 5.0 -3.0 0.2 Iris-setosa
28 26 5.0 3.4 0.4 Iris-setosa
29 27 5.2 3.5 0.2 Iris-setosa
30 28 5.2 3.4 0.2 Iris-setosa
31 29 4.7 3.2 0.2 Iris-setosa
32 30 4.8 3.1 1.6 0.2 Iris-setosa
33 31 5.4 3.4 1.5 0.4 Iris-setosa
34 32 5.2 4.1 1.5 0.1 Iris-setosa
35 33 5.5 4.2 1.4 0.2 Iris-setosa
36 34 4.9 3.1 1.5 0.1 Iris-setosa
37 35 5.0 3.2 1.2 0.2 Iris-setosa
38 36 5.5 3.5 1.3 0.2 Iris-setosa
39 37 4.9 1.5 0.1 Iris-setosa
40 38 4.4 3.0 1.3 0.2 Iris-setosa
41 39 5.1 3.4 1.5 0.2 Iris-setosa
42 40 5.0 3.5 1.3 0.3 Iris-setosa
43 41 4.5 2.3 1.3 0.3 Iris-setosa
44 42 4.4 3.2 1.3 0.2 Iris-setosa
45 43 5.0 3.5 1.6 0.6 Iris-setosa
46 44 5.1 3.8 1.9 0.4 Iris-setosa
47 45 4.8 3.0 1.4 0.3 Iris-setosa
48 46 5.1 3809.0 1.6 0.2 Iris-setosa
49 47 4.6 3.2 1.4 0.2 Iris-setosa
50 48 5.3 3.7 1.5 0.2 Iris-setosa
51 49 5.0 3.3 1.4 0.2 Iris-setosa
52 50 7.0 3.2 4.7 1.4 Iris-versicolor
53 51 6.4 3200.0 4.5 1.5 Iris-versicolor
54 52 6.9 3.1 4.9 1.5 Iris-versicolor
55 53 5.5 2.3 4.0 1.3 Iris-versicolor
56 54 6.5 2.8 4.6 1.5 Iris-versicolor
57 55 5.7 2.8 4.5 1.3 Iris-versicolor
58 56 6.3 3.3 4.7 1600.0 Iris-versicolor
59 57 4.9 2.4 3.3 1.0 Iris-versicolor
60 58 6.6 2.9 4.6 1.3 Iris-versicolor
61 59 5.2 2.7 3.9 Iris-versicolor
62 60 5.0 2.0 3.5 1.0 Iris-versicolor
63 61 5.9 3.0 4.2 1.5 Iris-versicolor
64 62 6.0 2.2 4.0 1.0 Iris-versicolor
65 63 6.1 2.9 4.7 1.4 Iris-versicolor
66 64 5.6 2.9 3.6 1.3 Iris-versicolor
67 65 6.7 3.1 4.4 1.4 Iris-versicolor
68 66 5.6 3.0 4.5 1.5 Iris-versicolor
69 67 5.8 2.7 4.1 1.0 Iris-versicolor
70 68 6.2 2.2 4.5 1.5 Iris-versicolor
71 69 5.6 2.5 3.9 1.1 Iris-versicolor
72 70 5.9 3.2 4.8 1.8 Iris-versicolor
73 71 6.1 2.8 4.0 1.3 Iris-versicolor
74 72 6.3 2.5 4.9 1.5 Iris-versicolor
75 73 6.1 2.8 4.7 1.2 Iris-versicolor
76 74 6.4 2.9 4.3 1.3 Iris-versicolor
77 75 6.6 3.0 4.4 1.4 Iris-versicolor
78 76 6.8 2.8 4.8 1.4 Iris-versicolor
79 77 6.7 3.0 5.0 1.7 Iris-versicolor
80 78 6.0 2.9 4.5 1.5 Iris-versicolor
81 79 5.7 2.6 3.5 1.0 Iris-versicolor
82 80 5.5 2.4 3.8 1.1 Iris-versicolor
83 81 5.5 2.4 3.7 1.0 Iris-versicolor
84 82 5.8 2.7 3.9 1.2 Iris-versicolor
85 83 6.0 2.7 5.1 1.6 Iris-versicolor
86 84 5.4 3.0 4.5 1.5 Iris-versicolor
87 85 6.0 3.4 4.5 1.6 Iris-versicolor
88 86 6.7 3.1 4.7 1.5 Iris-versicolor
89 87 6.3 2.3 4.4 1.3 Iris-versicolor
90 88 5.6 3.0 4.1 1.3 Iris-versicolor
91 89 5.5 2.5 4.0 1.3 Iris-versicolor
92 90 5.5 2.6 4.4 1.2 Iris-versicolor
93 91 6.1 3.0 4.6 1.4 Iris-versicolor
94 92 5.8 2.6 4.0 1.2 Iris-versicolor
95 93 5.0 2.3 3.3 1.0 Iris-versicolor
96 94 5.6 2.7 4.2 1.3 Iris-versicolor
97 95 5.7 3.0 4.2 1.2 Iris-versicolor
98 96 5.7 2.9 4.2 1.3 Iris-versicolor
99 97 6.2 2.9 4.3 1.3 Iris-versicolor
100 98 5.1 2.5 3.0 1.1 Iris-versicolor
101 99 5.7 2.8 1.3 Iris-versicolor
102 100 3.3 2.5 Iris-virginica
103 101 5.8 2.7 1.9 Iris-virginica
104 102 7.1 3.0 2.1 Iris-virginica
105 103 6.3 2.9 1.8 Iris-virginica
106 104 6.5 3.0 2.2 Iris-virginica
107 105 7.6 3.0 6.6 2.1 Iris-virginica
108 106 4.9 2.5 4.5 1.7 Iris-virginica
109 107 7.3 2.9 6.3 1.8 Iris-virginica
110 108 6.7 2.5 5.8 1.8 Iris-virginica
111 109 7.2 3.6 6.1 2.5 Iris-virginica
112 110 6.5 3.2 5.1 2.0 Iris-virginica
113 111 6.4 2.7 5.3 1.9 Iris-virginica
114 112 6.8 3.0 5.5 2.1 Iris-virginica
115 113 5.7 2.5 5.0 2.0 Iris-virginica
116 114 5.8 5.1 2.4 Iris-virginica
117 115 6.4 5.3 2.3 Iris-virginica
118 116 6.5 5.5 1.8 Iris-virginica
119 117 7.7 6.7 2.2 Iris-virginica
120 118 7.7 2.3 Iris-virginica
121 119 6.0 5.0 1.5 Iris-virginica
122 120 6.9 5.7 2.3 Iris-virginica
123 121 5.6 2.8 4.9 2.0 Iris-virginica
124 122 always check the data !!!!!!!!
125 123 6.3 2.7 4.9 1.8 Iris-virginica
126 124 6.7 3.3 5.7 2.1 Iris-virginica
127 125 7.2 3.2 6.0 1.8 Iris-virginica
128 126 6.2 2.8 -4.8 1.8 Iris-virginica
129 127 3.0 4.9 1.8 Iris-virginica
130 128 6.4 2.8 5.6 2.1 Iris-virginica
131 129 7.2 3.0 5.8 1.6 Iris-virginica
132 130 7.4 2.8 6.1 1.9 Iris-virginica
133 131 7.9 3.8 6.4 2.0 Iris-virginica
134 132 6.-4 2.8 5.6 2.2 Iris-virginica
135 133 6.3 2.8 1.5 Iris-virginica
136 134 6.1 2.6 5.6 1.4 Iris-virginica
137 135 7.7 3.0 6.1 2.3 Iris-virginica
138 136 6.3 3.4 5.6 2.4 Iris-virginica
139 137 6.4 3.1 5.5 1.8 Iris-virginica
140 138 6.0 3.0 4.8 1.8 Iris-virginica
141 139 6900 3.1 5.4 2.1 Iris-virginica
142 140 6.7 3.1 2.4 Iris-virginica
143 141 6.9 3.1 5.1 2.3 Iris-virginica
144 142 580 2.7 5.1 Iris-virginica
145 143 6.8 3.2 5.9 2.3 Iris-virginica
146 144 6.7 3.3 5.7 -2.5 Iris-virginica
147 145 6.7 3.0 5.2 2.3 Iris-virginica
148 146 6.3 2.5 5.0 1.9 Iris-virginica
149 147 6.5 3.0 5.2 2.0 Iris-virginica
150 148 6.2 3.4 5.4 2.3 Iris-virginica
151 149 5.9 3.0 5.1 1.8 Iris-virginica

152
piscine/week01/day02/ex04/data/iris.data

@ -0,0 +1,152 @@
sepal_length,sepal_width,petal_length,petal_width, flower
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,-3.6,-1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
-4.4,2.9,1400,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5.4,3.7,1.5,0.2,Iris-setosa
4.8,3.4,1.6,0.2,Iris-setosa
4.8,3.0,1.4,0.1,Iris-setosa
4.3,3.0,1.1,0.1,Iris-setosa
5.8,4.0,1.2,0.2,Iris-setosa
5.7,4.4,1500,0.4,Iris-setosa
5.4,3.9,1.3,0.4,Iris-setosa
5.1,3.5,1.4,0.3,Iris-setosa
5.7,3.8,1.7,0.3,Iris-setosa
5.1,3.8,1.5,0.3,Iris-setosa
5.4,3.4,-1.7,0.2,Iris-setosa
5.1,3.7,1.5,0.4,Iris-setosa
4.6,3.6,1.0,0.2,Iris-setosa
5.1,3.3,1.7,0.5,Iris-setosa
4.8,3.4,1.9,0.2,Iris-setosa
5.0,-3.0,1.6,0.2,Iris-setosa
5.0,3.4,1.6,0.4,Iris-setosa
5.2,3.5,1.5,0.2,Iris-setosa
5.2,3.4,1.4,0.2,Iris-setosa
4.7,3.2,1.6,0.2,Iris-setosa
4.8,3.1,1.6,0.2,Iris-setosa
5.4,3.4,1.5,0.4,Iris-setosa
5.2,4.1,1.5,0.1,Iris-setosa
5.5,4.2,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5.0,3.2,1.2,0.2,Iris-setosa
5.5,3.5,1.3,0.2,Iris-setosa
4.9,,1.5,0.1,Iris-setosa
4.4,3.0,1.3,0.2,Iris-setosa
5.1,3.4,1.5,0.2,Iris-setosa
5.0,"3.5",1.3,0.3,Iris-setosa
4.5,2.3,1.3,0.3,Iris-setosa
4.4,3.2,1.3,0.2,Iris-setosa
5.0,3.5,1.6,0.6,Iris-setosa
5.1,3.8,1.9,0.4,Iris-setosa
4.8,3.0,1.4,0.3,Iris-setosa
5.1,3809,1.6,0.2,Iris-setosa
4.6,3.2,1.4,0.2,Iris-setosa
5.3,3.7,1.5,0.2,Iris-setosa
5.0,3.3,1.4,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
6.4,3200,4.5,1.5,Iris-versicolor
6.9,3.1,4.9,1.5,Iris-versicolor
5.5,2.3,4.0,1.3,Iris-versicolor
6.5,2.8,4.6,1.5,Iris-versicolor
5.7,2.8,4.5,1.3,Iris-versicolor
6.3,3.3,4.7,1600,Iris-versicolor
4.9,2.4,3.3,1.0,Iris-versicolor
6.6,2.9,4.6,1.3,Iris-versicolor
5.2,2.7,3.9,,Iris-versicolor
5.0,2.0,3.5,1.0,Iris-versicolor
5.9,3.0,4.2,1.5,Iris-versicolor
6.0,2.2,4.0,1.0,Iris-versicolor
6.1,2.9,4.7,1.4,Iris-versicolor
5.6,2.9,3.6,1.3,Iris-versicolor
6.7,3.1,4.4,1.4,Iris-versicolor
5.6,3.0,4.5,1.5,Iris-versicolor
5.8,2.7,4.1,1.0,Iris-versicolor
6.2,2.2,4.5,1.5,Iris-versicolor
5.6,2.5,3.9,1.1,Iris-versicolor
5.9,3.2,4.8,1.8,Iris-versicolor
6.1,2.8,4.0,1.3,Iris-versicolor
6.3,2.5,4.9,1.5,Iris-versicolor
6.1,2.8,4.7,1.2,Iris-versicolor
6.4,2.9,4.3,1.3,Iris-versicolor
6.6,3.0,4.4,1.4,Iris-versicolor
6.8,2.8,4.8,1.4,Iris-versicolor
6.7,3.0,5.0,1.7,Iris-versicolor
6.0,2.9,4.5,1.5,Iris-versicolor
5.7,2.6,3.5,1.0,Iris-versicolor
5.5,2.4,3.8,1.1,Iris-versicolor
5.5,2.4,3.7,1.0,Iris-versicolor
5.8,2.7,3.9,1.2,Iris-versicolor
6.0,2.7,5.1,1.6,Iris-versicolor
5.4,3.0,4.5,1.5,Iris-versicolor
6.0,3.4,4.5,1.6,Iris-versicolor
6.7,3.1,4.7,1.5,Iris-versicolor
6.3,2.3,4.4,1.3,Iris-versicolor
5.6,3.0,4.1,1.3,Iris-versicolor
5.5,2.5,4.0,1.3,Iris-versicolor
5.5,2.6,4.4,1.2,Iris-versicolor
6.1,3.0,4.6,1.4,Iris-versicolor
5.8,2.6,4.0,1.2,Iris-versicolor
5.0,2.3,3.3,1.0,Iris-versicolor
5.6,2.7,4.2,1.3,Iris-versicolor
5.7,3.0,4.2,1.2,Iris-versicolor
5.7,2.9,4.2,1.3,Iris-versicolor
6.2,2.9,4.3,1.3,Iris-versicolor
5.1,2.5,3.0,1.1,Iris-versicolor
5.7,2.8,4.1,1.3,Iris-versicolor
6.3,3.3,6.0,2.5,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
7.1,3.0,5.9,2.1,Iris-virginica
6.3,2.9,5.6,1.8,Iris-virginica
6.5,3.0,5.8,2.2,Iris-virginica
7.6,3.0,6.6,2.1,Iris-virginica
4.9,2.5,4.5,1.7,Iris-virginica
7.3,2.9,6.3,1.8,Iris-virginica
6.7,2.5,5.8,1.8,Iris-virginica
7.2,3.6,6.1,2.5,Iris-virginica
6.5,3.2,5.1,2.0,Iris-virginica
6.4,2.7,5.3,1.9,Iris-virginica
6.8,3.0,5.5,2.1,Iris-virginica
5.7,2.5,5.0,2.0,Iris-virginica
5.8,2.8,5.1,2.4,Iris-virginica
6.4,3.2,5.3,2.3,Iris-virginica
6.5,3.0,5.5,1.8,Iris-virginica
7.7,3.8,6.7,2.2,Iris-virginica
7.7,2.6,6.9,2.3,Iris-virginica
6.0,2.2,5.0,1.5,Iris-virginica
6.9,3.2,5.7,2.3,Iris-virginica
5.6,2.8,4.9,2.0,Iris-virginica
7.7,2.8,6.7,2.0,Iris-virginica
6.3,2.7,4.9,1.8,Iris-virginica
6.7,3.3,5.7,2.1,Iris-virginica
7.2,3.2,6.0,1.8,Iris-virginica
6.2,2.8,-4.8,1.8,Iris-virginica
6.1,3.0,4.9,1.8,Iris-virginica
6.4,2.8,5.6,2.1,Iris-virginica
7.2,3.0,5.8,1.6,Iris-virginica
7.4,2.8,6.1,1.9,Iris-virginica
7.9,3.8,6.4,2.0,Iris-virginica
6.-4,2.8,5.6,2.2,Iris-virginica
6.3,2.8,"5.1",1.5,Iris-virginica
6.1,2.6,5.6,1.4,Iris-virginica
7.7,3.0,6.1,2.3,Iris-virginica
6.3,3.4,5.6,2.4,Iris-virginica
6.4,3.1,5.5,1.8,Iris-virginica
6.0,3.0,4.8,1.8,Iris-virginica
6900,3.1,5.4,2.1,Iris-virginica
6.7,3.1,5.6,2.4,Iris-virginica
6.9,3.1,5.1,2.3,Iris-virginica
580,2.7,5.1,1.9,Iris-virginica
6.8,3.2,5.9,2.3,Iris-virginica
6.7,3.3,5.7,-2.5,Iris-virginica
6.7,3.0,5.2,2.3,Iris-virginica
6.3,2.5,5.0,1.9,Iris-virginica
6.5,3.0,5.2,2.0,Iris-virginica
6.2,3.4,5.4,2.3,Iris-virginica
5.9,3.0,5.1,1.8,Iris-virginica

26
piscine/week01/day02/ex04/readme.md

@ -0,0 +1,26 @@
# Exercice 4 Handling missing values
The goal of this exercise is to learn to handle missing values. In the previous exercise we used the first techniques: filter out the missing values. We were lucky because the proportion of missing values was low. But in some cases, dropping the missing values is not possible because the filtered data set would be too small.
This article explains the different types of missing data and how they should be handled.
https://towardsdatascience.com/data-cleaning-with-python-and-pandas-detecting-missing-values-3e9c6ebcf78b
"**It’s important to understand these different types of missing data from a statistics point of view. The type of missing data will influence how you deal with filling in the missing values.**"
- Preliminary: Drop the `flower` column
1. Fill the missing values with a different "strategy" for each column:
`sepal_length` -> `mean`
`sepal_width` -> `median`
`petal_length`, `petal_width` -> `0`
2. Fill the missing values using the median of the associated column using `fillna`.
- Bonus questions:
- Filling the missing values by 0 or the mean of the associated column is common in Data Science. In that case, explain why filling the missing values with 0 or the mean is a bad idea.
- Find a special row ;-)

48
piscine/week01/day02/readme.md

@ -0,0 +1,48 @@
# W1D02 Piscine AI - Data Science
## Pandas
The goal of this day is to understand practical usage of **Pandas**.
As **Pandas** in intensively used in Data Science, other days of the piscine will be dedicated to it.
Not only is the **Pandas** library a central component of the data science toolkit but it is used in conjunction with other libraries in that collection.
**Pandas** is built on top of the NumPy package, meaning a lot of the structure of NumPy is used or replicated in **Pandas**. Data in **Pandas** is often used to feed statistical analysis in SciPy, plotting functions from Matplotlib, and machine learning algorithms in Scikit-learn.
Most of the topics we will cover today are explained and describes with examples in the first resource. The number of exercises is low on purpose: Take the time to understand the chapter 5 of the resource, even if there are 40 pages.
## Exercises of the day
- Exercise 1 Your first DataFrame
- Exercise 2 Electric power consumption
- Exercise 3 E-commerce purchases
- Exercise 4 Handling missing values
## Virtual Environment
- Python 3.x
- NumPy
- Pandas
- Jupyter or JupyterLab
*Version of Pandas I used to do the exercises: 1.0.1*.
I suggest to use the most recent one.
## Resources
- If I had to give you one resource it would be this one:
https://bedford-computing.co.uk/learning/wp-content/uploads/2015/10/Python-for-Data-Analysis.pdf
It contains ALL you need to know about Pandas.
- Pandas documentation:
- https://pandas.pydata.org/docs/
- https://jakevdp.github.io/PythonDataScienceHandbook/
- https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
- https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/
- https://jakevdp.github.io/PythonDataScienceHandbook/03.04-missing-values.html

9
piscine/week01/day03/ex00/audit/readme.md

@ -0,0 +1,9 @@
##### The exercice is validated is all questions of the exercice are validated
##### Activate the virtual environment. If you used `conda` run `conda activate your_env`
##### Run `python --version`
###### Does it print `Python 3.x`? x >= 8
##### Does `import jupyter`, `import numpy`, `import pandas`, `matplotlib` and `plotly` run without any error ?

62
piscine/week01/day03/ex00/readme.md

@ -0,0 +1,62 @@
# W1D03 Piscine AI - Data Science
## Visualizations
While working on a dataset it is important to check the distribution of the data. Obviously, for most of humans it is difficult to visualize the data in more than 3 dimensions
"Viz" is important to understand the data and to show results. We'll discover three libraries to visualize data in Python. These are one of the most used visualisation "libraries" in Python:
- Pandas visualization module
- Matplotlib
- Plotly
The goal is to understand the basics of those libraries. You'll have time during the project to master one (or the three) of them.
You may wonder why using one library is not enough. The reason is simple: it depends on the usage.
For example if you want to check the data quickly you may want to use Pandas viz module or Matplotlib.
If you want to plot a custom and more elaborated plot I suggest to use Matplotlib or Plotly.
And, if you want to create a very nice and interactive plot I suggest to use Plotly.
## Exercises of the day
- Exercise 1 Pandas plot 1
- Exercise 2 Pandas plot 2
- Exercise 3 Matplotlib 1
- Exercise 4 Matplotlib 2
- Exercise 5 Matplotlib subplots
- Exercise 6 Plotly 1
- Exercise 7 Plotly Box plots
## Virtual Environment
- Python 3.x
- NumPy
- Pandas
- Matplotlib
- Plotly
- Jupyter or JupyterLab
I suggest to use the most recent version of the packages.
## Resources
- https://matplotlib.org/3.3.3/tutorials/index.html
- https://towardsdatascience.com/matplotlib-tutorial-learn-basics-of-pythons-powerful-plotting-library-b5d1b8f67596
- https://github.com/rougier/matplotlib-tutorial
- https://jakevdp.github.io/PythonDataScienceHandbook/05.13-kernel-density-estimation.html
# Exercise 0 Environment and libraries
The goal of this exercise is to set up the Python work environment with the required libraries.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
I recommend to use:
- the **last stable versions** of Python.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy`, `jupyter`, `matplotlib` and `plotly`.

8
piscine/week01/day03/ex01/audit/readme.md

@ -0,0 +1,8 @@
##### The solution of question 1 is accepted if the plot reproduces the plot in the image and respect those criteria.
###### Does it have a the title ?
###### Does it have a name on x-axis ?
###### Does it have a legend ?
![alt text][logo]
[logo]: ../w1day03_ex1_plot1.png "Bar plot ex1"

28
piscine/week01/day03/ex01/readme.md

@ -0,0 +1,28 @@
# Exercise 1 Pandas plot 1
The goal of this exercise is to learn to create plots with use Pandas. Panda's `.plot()` is a wrapper for `matplotlib.pyplot.plot()`.
Here is the data we will be using:
```python
df = pd.DataFrame({
'name':['christopher','marion','maria','mia','clement','randy','remi'],
'age':[70,30,22,19,45,33,20],
'gender':['M','F','F','F','M','M','M'],
'state':['california','dc','california','dc','california','new york','porto'],
'num_children':[2,0,0,3,8,1,4],
'num_pets':[5,1,0,5,2,2,3]
})
```
1. Reproduce this plot. This plot is called a bar plot
![alt text][logo]
[logo]: ./w1day03_ex1_plot1.png "Bar plot ex1"
The plot has to contain:
- the title
- name on x-axis
- legend

BIN
piscine/week01/day03/ex01/w1day03_ex1_plot1.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 9.5 KiB

8
piscine/week01/day03/ex02/audit/readme.md

@ -0,0 +1,8 @@
##### The solution of question 1 is accepted if the plot reproduces the plot in the image and respect the following criteria. You should also observe that the older people are, the the more children they have.
###### Does it have a the title ?
###### Does it have a name on x-axis and y-axis ?
![alt text][logo_ex2]
[logo_ex2]: ../w1day03_ex2_plot1.png "Scatter plot ex2"

26
piscine/week01/day03/ex02/readme.md

@ -0,0 +1,26 @@
## Exercise 2: Pandas plot 2
The goal of this exercise is to learn to create plots with use Pandas. Panda's `.plot()` is a wrapper for `matplotlib.pyplot.plot()`.
```python
df = pd.DataFrame({
'name':['christopher','marion','maria','mia','clement','randy','remi'],
'age':[70,30,22,19,45,33,20],
'gender':['M','F','F','F','M','M','M'],
'state':['california','dc','california','dc','california','new york','porto'],
'num_children':[4,2,1,0,3,1,0],
'num_pets':[5,1,0,2,2,2,3]
})
```
1. Reproduce this plot. This plot is called a scatter plot. Do you observe a relationship between the age and the number of children ?
![alt text][logo_ex2]
[logo_ex2]: ./w1day03_ex2_plot1.png "Scatter plot ex2"
The plot has to contain:
- the title
- name on x-axis
- name on y-axis

BIN
piscine/week01/day03/ex02/w1day03_ex2_plot1.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 11 KiB

11
piscine/week01/day03/ex03/audit/readme.md

@ -0,0 +1,11 @@
##### The solution of question 1 is accepted if the plot reproduces the plot in the image and respect those criteria.
###### Does it have a the title ?
###### Does it have a name on x-axis and y-axis ?
###### Are the x-axis and y-axis limited to [1,8] ?
###### Is the line a red dashdot line with a width of 3 ?
###### Are the circles blue circles with a size of 12 ?
![alt text][logo_ex3]
[logo_ex3]: ../w1day03_ex3_plot1.png "Scatter plot ex3"

18
piscine/week01/day03/ex03/readme.md

@ -0,0 +1,18 @@
## Exercise 3 Matplotlib 1
The goal of this plot is to learn to use Matplotlib to plot data. As you know, Matplotlib is the underlying library used by Pandas. It provides more options to plot custom visualizations. Howerver, most of the plots we will create with Matplotlib can be reproduced with Pandas' `.plot()`.
1. Reproduce this plot. We assume the data points have integers coordinates.
![alt text][logo_ex3]
[logo_ex3]: ./w1day03_ex3_plot1.png "Scatter plot ex3"
The plot has to contain:
- the title
- name on x-axis and y-axis
- x-axis and y-axis are limited to [1,8]
- **style**:
- red dashdot line with a width of 3
- blue circles with a size of 12

BIN
piscine/week01/day03/ex03/w1day03_ex3_plot1.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 27 KiB

12
piscine/week01/day03/ex04/audit/readme.md

@ -0,0 +1,12 @@
##### The solution of question 1 is accepted if the plot reproduces the plot in the image and respect those criteria.
###### Does it have a the title ?
###### Does it have a name on x-axis and y-axis ?
###### Is the left data black ?
###### Is the right data red ?
![alt text][logo_ex4]
[logo_ex4]: ../w1day03_ex4_plot1.png "Twin axis ex4"
https://matplotlib.org/gallery/api/two_scales.html

25
piscine/week01/day03/ex04/readme.md

@ -0,0 +1,25 @@
# Exercise 4 Matplotlib 2
The goal of this plot is to learn to use Matplotlib to plot different lines in the same plot on different axis using `twinx`. This very useful to compare variables in different ranges.
Here is the data:
```python
left_data = [5, 7, 11, 13, 17]
right_data = [0.1, 0.2, 0.4, 0.8, -1.6]
x_axis = [0.0, 1.0, 2.0, 3.0, 4.0]
```
1. Reproduce this plot
![alt text][logo_ex4]
[logo_ex4]: ./w1day03_ex4_plot1.png "Twin axis plot ex4"
The plot has to contain:
- the title
- name on left y-axis and right y-axis
- **style**:
- left data in black
- right data in red

BIN
piscine/week01/day03/ex04/w1day03_ex4_plot1.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 18 KiB

11
piscine/week01/day03/ex05/audit/readme.md

@ -0,0 +1,11 @@
##### The solution of question 1 is accepted if the plot reproduces the plot in the image and respect those criteria.
###### Does it contain 6 subplots (2 rows, 3 columns)?
###### Does it have space between plots (`hspace=0.5` and `wspace=0.5`)?
###### Do all subplots contain a title: `Title i` ?
###### Do all subplots contain a text `(2,3,i)` centered at `(0.5, 0.5)`? *Hint*: check the parameter `ha` of `text`
###### Have all subplots been created in a for loop ?
![alt text][logo_ex5]
[logo_ex5]: ../w1day03_ex5_plot1.png "Subplots ex5"

18
piscine/week01/day03/ex05/readme.md

@ -0,0 +1,18 @@
# Exercise 5 Matplotlib subplots
The goal of this exercise is to learn to use Matplotlib to create subplots.
1. Reproduce this plot using a **for loop**:
![alt text][logo_ex5]
[logo_ex5]: ./w1day03_ex5_plot1.png "Subplots ex5"
The plot has to contain:
- 6 subplots: 2 rows, 3 columns
- Keep space between plots: `hspace=0.5` and `wspace=0.5`
- Each plot contains
- Text (2,3,i) centered at 0.5, 0.5. *Hint*: check the parameter `ha` of `text`
- a title: Title i

BIN
piscine/week01/day03/ex05/w1day03_ex5_plot1.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 13 KiB

25
piscine/week01/day03/ex06/audit/readme.md

@ -0,0 +1,25 @@
##### The exercice is validated is all questions of the exercice are validated
##### The solution of question 1 is accepted if the plot reproduces the plot in the image and respect those criteria.
###### Does it have a the title ?
###### Does it have a name on x-axis and y-axis ?
![alt text][logo_ex6]
[logo_ex6]: ../w1day03_ex6_plot1.png "Time series ex6"
##### The solution of question 2 is accepted if the plot reproduces the plot in the image by using `plotly.graph_objects` and respect those criteria.
2.This question is validated if the plot is in the image is reproduced using `plotly.graph_objects` given those criteria:
###### Does it have a the title ?
###### Does it have a name on x-axis and y-axis ?
![alt text][logo_ex6]
[logo_ex6]: ../w1day03_ex6_plot1.png "Time series ex6"

34
piscine/week01/day03/ex06/readme.md

@ -0,0 +1,34 @@
# Exercise 6 Plotly 1
Plotly has evolved a lot in the previous years. It is important to **always check the documentation**.
Plotly comes with a high level interface: Plotly Express. It helps building some complex plots easily. The lesson won't detail the complex examples. Plotly express is quite interesting while using Pandas Dataframes because there are some built-in functions that leverage Pandas Dataframes.
The plot outputed by Plotly is interactive and can also be dynamic.
The goal of the exercise is to plot the price of a company. Its price is generated below.
```python
returns = np.random.randn(50)
price = 100 + np.cumsum(returns)
dates = pd.date_range(start='2020-09-01', periods=50, freq='B')
df = pd.DataFrame(zip(dates, price),
columns=['Date','Company_A'])
```
1. Using **Plotly express**, reproduce the plot in the image. As the data is generated randomly I do not expect you to reproduce the same line.
![alt text][logo_ex6]
[logo_ex6]: ./w1day03_ex6_plot1.png "Time series ex6"
The plot has to contain:
- title
- x-axis name
- yaxis name
2. Same question but now using `plotly.graph_objects`. You may need to use `init_notebook_mode` from `plotly.offline`.
https://plotly.com/python/time-series/e

BIN
piscine/week01/day03/ex06/w1day03_ex6_plot1.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 43 KiB

25
piscine/week01/day03/ex07/audit/readme.md

@ -0,0 +1,25 @@
##### The solution of question 1 is accepted if the plot reproduces the plot in the image and respect those criteria. The code below shows a solution.
###### Does it have a the title ?
###### Does it have a legend ?
![alt text][logo_ex7]
[logo_ex7]: ../w1day03_ex7_plot1.png "Box plot ex7"
```python
import plotly.graph_objects as go
import numpy as np
y0 = np.random.randn(50)
y1 = np.random.randn(50) + 1 # shift mean
y2 = np.random.randn(50) + 2
fig = go.Figure()
fig.add_trace(go.Box(y=y0, name='Sample A',
marker_color = 'indianred'))
fig.add_trace(go.Box(y=y1, name = 'Sample B',
marker_color = 'lightseagreen'))
fig.show()
```

24
piscine/week01/day03/ex07/readme.md

@ -0,0 +1,24 @@
# Exercise 7 Plotly Box plots
The goal of this exercise is to learn to use Plotly to plot Box Plots. It is t is a method for graphically depicting groups of numerical data through their quartiles and values as min, max. It allows to compare quickly some variables.
Let us generate 3 random arrays from a normal distribution. And for each array add respectively 1, 2 to the normal distribution.
```python
y0 = np.random.randn(50)
y1 = np.random.randn(50) + 1 # shift mean
y2 = np.random.randn(50) + 2
```
1. Plot in the same Figure 2 box plots as shown in the image. In this exercise the style is not important.
![alt text][logo_ex7]
[logo_ex7]: ./w1day03_ex7_plot1.png "Box plot ex7"
The plot has to contain:
- the title
- the legend
https://plotly.com/python/box-plots/

BIN
piscine/week01/day03/ex07/w1day03_ex7_plot1.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 13 KiB

47
piscine/week01/day03/readme.md

@ -0,0 +1,47 @@
# W1D03 Piscine AI - Data Science
## Visualizations
While working on a dataset it is important to check the distribution of the data. Obviously, for most of humans it is difficult to visualize the data in more than 3 dimensions
"Viz" is important to understand the data and to show results. We'll discover three libraries to visualize data in Python. These are one of the most used visualisation "libraries" in Python:
- Pandas visualization module
- Matplotlib
- Plotly
The goal is to understand the basics of those libraries. You'll have time during the project to master one (or the three) of them.
You may wonder why using one library is not enough. The reason is simple: it depends on the usage.
For example if you want to check the data quickly you may want to use Pandas viz module or Matplotlib.
If you want to plot a custom and more elaborated plot I suggest to use Matplotlib or Plotly.
And, if you want to create a very nice and interactive plot I suggest to use Plotly.
## Exercises of the day
- Exercise 1 Pandas plot 1
- Exercise 2 Pandas plot 2
- Exercise 3 Matplotlib 1
- Exercise 4 Matplotlib 2
- Exercise 5 Matplotlib subplots
- Exercise 6 Plotly 1
- Exercise 7 Plotly Box plots
## Virtual Environment
- Python 3.x
- NumPy
- Pandas
- Matplotlib
- Plotly
- Jupyter or JupyterLab
I suggest to use the most recent version of the packages.
## Resources
- https://matplotlib.org/3.3.3/tutorials/index.html
- https://towardsdatascience.com/matplotlib-tutorial-learn-basics-of-pythons-powerful-plotting-library-b5d1b8f67596
- https://github.com/rougier/matplotlib-tutorial
- https://jakevdp.github.io/PythonDataScienceHandbook/05.13-kernel-density-estimation.html

9
piscine/week01/day04/ex00/audit/readme.md

@ -0,0 +1,9 @@
##### The exercice is validated is all questions of the exercice are validated
##### Activate the virtual environment. If you used `conda` run `conda activate your_env`
##### Run `python --version`
###### Does it print `Python 3.x`? x >= 8
##### Does `import jupyter`, `import numpy` and `import pandas` run without any error ?

55
piscine/week01/day04/ex00/readme.md

@ -0,0 +1,55 @@
# W1D04 Piscine AI - Data Science
## Data wrangling with Pandas
Data wrangling is one of the crucial tasks in data science and analysis which includes operations like:
- Data Sorting: To rearrange values in ascending or descending order.
- Data Filtration: To create a subset of available data.
- Data Reduction: To eliminate or replace unwanted values.
- Data Access: To read or write data files.
- Data Processing: To perform aggregation, statistical, and similar operations on specific values.
Ax explained before, Pandas is an open source library, specifically developed for data science and analysis. It is built upon the Numpy (to handle numeric data in tabular form) package and has inbuilt data structures to ease-up the process of data manipulation, aka data munging/wrangling.
## Exercises of the day
- Exercise 1 Concatenate
- Exercise 2 Merge
- Exercise 3 Merge MultiIndex
- Exercise 4 Groupby Apply
- Exercise 5 Groupby Agg
- Exercise 6 Unstack
## Virtual Environment
- Python 3.x
- NumPy
- Pandas
- Jupyter or JupyterLab
*Version of Pandas I used to do the exercises: 1.0.1*.
I suggest to use the most recent one.
## Resources
- https://jakevdp.github.io/PythonDataScienceHandbook/
- https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
- https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/
- https://towardsdatascience.com/different-ways-to-iterate-over-rows-in-a-pandas-dataframe-performance-comparison-dc0d5dcef8fe
# Exercise 0 Environment and libraries
The goal of this exercise is to set up the Python work environment with the required libraries.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
I recommend to use:
- the **last stable versions** of Python.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy` and `jupyter`.

8
piscine/week01/day04/ex01/audit/readme.md

@ -0,0 +1,8 @@
##### This question is validated if the outputted DataFrame is:
| | letter | number |
|---:|:---------|---------:|
| 0 | a | 1 |
| 1 | b | 2 |
| 2 | c | 1 |
| 3 | d | 2 |

14
piscine/week01/day04/ex01/readme.md

@ -0,0 +1,14 @@
# Exercise 1 Concatenate
The goal of this exercise is to learn to concatenate DataFrames. The logic is the same for the Series.
Here are the two DataFrames to concatenate:
```python
df1 = pd.DataFrame([['a', 1], ['b', 2]],
columns=['letter', 'number'])
df2 = pd.DataFrame([['c', 1], ['d', 2]],
columns=['letter', 'number'])
```
1. Concatenate this two DataFrames on index axis and reset the index. The index of the outputted should be `RangeIndex(start=0, stop=4, step=1)`. **Do not change the index manually**.

23
piscine/week01/day04/ex02/audit/readme.md

@ -0,0 +1,23 @@
##### The exercice is validated is all questions of the exercice are validated
##### The question 1 is validated if the output is:
| | id | Feature1_x | Feature2_x | Feature1_y | Feature2_y |
|---:|-----:|:-------------|:-------------|:-------------|:-------------|
| 0 | 1 | A | B | K | L |
| 1 | 2 | C | D | M | N |
##### The question 2 is validated if the output is:
| | id | Feature1_df1 | Feature2_df1 | Feature1_df2 | Feature2_df2 |
|---:|-----:|:---------------|:---------------|:---------------|:---------------|
| 0 | 1 | A | B | K | L |
| 1 | 2 | C | D | M | N |
| 2 | 3 | E | F | nan | nan |
| 3 | 4 | G | H | nan | nan |
| 4 | 5 | I | J | nan | nan |
| 5 | 6 | nan | nan | O | P |
| 6 | 7 | nan | nan | Q | R |
| 7 | 8 | nan | nan | S | T |
Note: Check that the suffixes are set using the suffix parameters rather than manually changing the columns' name.

46
piscine/week01/day04/ex02/readme.md

@ -0,0 +1,46 @@
# Exercise 2 Merge
The goal of this exercise is to learn to merge DataFrames
The logic of merging DataFrames in Pandas is quite similar as the one used in SQL.
Here are the two DataFrames to merge:
```python
#df1
df1_dict = {
'id': ['1', '2', '3', '4', '5'],
'Feature1': ['A', 'C', 'E', 'G', 'I'],
'Feature2': ['B', 'D', 'F', 'H', 'J']}
df1 = pd.DataFrame(df1_dict, columns = ['id', 'Feature1', 'Feature2'])
#df2
df2_dict = {
'id': ['1', '2', '6', '7', '8'],
'Feature1': ['K', 'M', 'O', 'Q', 'S'],
'Feature2': ['L', 'N', 'P', 'R', 'T']}
df2 = pd.DataFrame(df2_dict, columns = ['id', 'Feature1', 'Feature2'])
```
1. Merge the two DataFrames to get this output:
| | id | Feature1_x | Feature2_x | Feature1_y | Feature2_y |
|---:|-----:|:-------------|:-------------|:-------------|:-------------|
| 0 | 1 | A | B | K | L |
| 1 | 2 | C | D | M | N |
2. Merge the two DataFrames to get this output:
| | id | Feature1_df1 | Feature2_df1 | Feature1_df2 | Feature2_df2 |
|---:|-----:|:---------------|:---------------|:---------------|:---------------|
| 0 | 1 | A | B | K | L |
| 1 | 2 | C | D | M | N |
| 2 | 3 | E | F | nan | nan |
| 3 | 4 | G | H | nan | nan |
| 4 | 5 | I | J | nan | nan |
| 5 | 6 | nan | nan | O | P |
| 6 | 7 | nan | nan | Q | R |
| 7 | 8 | nan | nan | S | T |

14
piscine/week01/day04/ex03/audit/readme.md

@ -0,0 +1,14 @@
##### The exercice is validated is all questions of the exercice are validated
##### The question 1 is validated if the outputted DataFrame's shape is `(1305, 5)` and if `merged.head()` returns a table as below. One of the answers that returns the correct DataFrame is `market_data.merge(alternative_data, how='left', left_index=True, right_index=True)`
| | Open | Close | Close_Adjusted | Twitter | Reddit |
|:-----------------------------------------------------|-----------:|----------:|-----------------:|------------:|----------:|
| (Timestamp('2021-01-01 00:00:00', freq='B'), 'AAPL') | 0.0991792 | -0.31603 | 0.634787 | -0.00159041 | 1.06053 |
| (Timestamp('2021-01-01 00:00:00', freq='B'), 'FB') | -0.123753 | 1.00269 | 0.713264 | 0.0142127 | -0.487028 |
| (Timestamp('2021-01-01 00:00:00', freq='B'), 'GE') | -1.37775 | -1.01504 | 1.2858 | 0.109835 | 0.04273 |
| (Timestamp('2021-01-01 00:00:00', freq='B'), 'AMZN') | 1.06324 | 0.841241 | -0.799481 | -0.805677 | 0.511769 |
| (Timestamp('2021-01-01 00:00:00', freq='B'), 'DAI') | -0.603453 | -2.06141 | -0.969064 | 1.49817 | 0.730055 |
##### The question 2 is validated if the numbers that are missing in the DataFrame are equal to 0 and if `filled_df.sum().sum() == merged_df.sum().sum()` gives: `True`

34
piscine/week01/day04/ex03/readme.md

@ -0,0 +1,34 @@
# Exercise 3 Merge MultiIndex
The goal of this exercise is to learn to merge DataFrames with MultiIndex.
Use the code below to generate the DataFrames. `market_data` contains fake market data. In finance, the market is available during the trading days (business days). `alternative_data` contains fake alternative data from social media. This data is available every day. But, for some reasons the Data Engineer lost the last 15 days of alternative data.
1. Using `market_data` as the reference, merge `alternative_data` on `market_data`
```python
#generate days
all_dates = pd.date_range('2021-01-01', '2021-12-15')
business_dates = pd.bdate_range('2021-01-01', '2021-12-31')
#generate tickers
tickers = ['AAPL', 'FB', 'GE', 'AMZN', 'DAI']
#create indexs
index_alt = pd.MultiIndex.from_product([all_dates, tickers], names=['Date', 'Ticker'])
index = pd.MultiIndex.from_product([business_dates, tickers], names=['Date', 'Ticker'])
# create DFs
market_data = pd.DataFrame(index=index,
data=np.random.randn(len(index), 3),
columns=['Open','Close','Close_Adjusted'])
alternative_data = pd.DataFrame(index=index_alt,
data=np.random.randn(len(index_alt), 2),
columns=['Twitter','Reddit'])
```
`reset_index` is not allowed for this question
2. Fill missing values with 0
- https://medium.com/swlh/merging-dataframes-with-pandas-pd-merge-7764c7e2d46d

56
piscine/week01/day04/ex04/audit/readme.md

@ -0,0 +1,56 @@
##### The exercice is validated is all questions of the exercice are validated and if the for loop hasn't been used. The goal is to use `groupby` and `apply`.
##### The question 1 is validated if the output is:
```python
df = pd.DataFrame(range(1,11), columns=['sequence'])
print(winsorize(df, [0.20, 0.80]).to_markdown())
```
| | sequence |
|---:|-----------:|
| 0 | 2.8 |
| 1 | 2.8 |
| 2 | 3 |
| 3 | 4 |
| 4 | 5 |
| 5 | 6 |
| 6 | 7 |
| 7 | 8 |
| 8 | 8.2 |
| 9 | 8.2 |
##### The question 2 is validated if the output is a Pandas Series or DataFrame with the first 11 rows equal to the output below. The code below give a solution.
| | sequence |
|---:|-----------:|
| 0 | 1.45 |
| 1 | 2 |
| 2 | 3 |
| 3 | 4 |
| 4 | 5 |
| 5 | 6 |
| 6 | 7 |
| 7 | 8 |
| 8 | 9 |
| 9 | 9.55 |
| 10 | 11.45 |
```python
def winsorize(df_series, quantiles):
"""
df: pd.DataFrame or pd.Series
quantiles: list [0.05, 0.95]
"""
min_value = np.quantile(df_series, quantiles[0])
max_value = np.quantile(df_series, quantiles[1])
return df_series.clip(lower = min_value, upper = max_value)
df.groupby("group")[['sequence']].apply(winsorize, [0.05,0.95])
```
- https://towardsdatascience.com/how-to-use-the-split-apply-combine-strategy-in-pandas-groupby-29e0eb44b62e

65
piscine/week01/day04/ex04/readme.md

@ -0,0 +1,65 @@
# Exercise 4 Groupby Apply
The goal of this exercise is to learn to group the data and apply a function on the groups.
The use case we will work on is computing
1. Create a function that uses `pandas.DataFrame.clip` and that replace extreme values by a given percentile. The values that are greater than the upper percentile 80% are replaced by the percentile 80%. The values that are smaller than the lower percentile 20% are replaced by the percentile 20%. This process that correct outliers is called **winsorizing**.
I recommend to use NumPy to compute the percentiles to make sure we used the same default parameters.
```python
def winsorize(df, quantiles):
"""
df: pd.DataFrame
quantiles: list
ex: [0.05, 0.95]
"""
#TODO
return
```
Here is what the function should output:
```python
df = pd.DataFrame(range(1,11), columns=['sequence'])
print(winsorize(df, [0.20, 0.80]).to_markdown())
```
| | sequence |
|---:|-----------:|
| 0 | 2.8 |
| 1 | 2.8 |
| 2 | 3 |
| 3 | 4 |
| 4 | 5 |
| 5 | 6 |
| 6 | 7 |
| 7 | 8 |
| 8 | 8.2 |
| 9 | 8.2 |
2. Now we consider that each value belongs to a group. The goal is to apply the **winsorizing to each group**. In this question we use winsorizing values that are common: `[0.05,0.95]` as percentiles. Here is the new data set:
```python
groups = np.concatenate([np.ones(10), np.ones(10)+1, np.ones(10)+2, np.ones(10)+3, np.ones(10)+4])
df = pd.DataFrame(data= zip(groups,
range(1,51)),
columns=["group", "sequence"])
```
The expected output (first rows) is:
| | sequence |
|---:|-----------:|
| 0 | 1.45 |
| 1 | 2 |
| 2 | 3 |
| 3 | 4 |
| 4 | 5 |
| 5 | 6 |
| 6 | 7 |
| 7 | 8 |
| 8 | 9 |
| 9 | 9.55 |
| 10 | 11.45 |

8
piscine/week01/day04/ex05/audit/readme.md

@ -0,0 +1,8 @@
##### The question is validated if the output is as below. The columns don't have to be MultiIndex. A solution could be `df.groupby('product').agg({'value':['min','max','mean']})`
| product | ('value', 'min') | ('value', 'max') | ('value', 'mean') |
|:-------------|-------------------:|-------------------:|--------------------:|
| chair | 22.89 | 32.12 | 27.505 |
| mobile phone | 100 | 111.22 | 105.61 |
| table | 20.45 | 99.99 | 51.22 |

23
piscine/week01/day04/ex05/readme.md

@ -0,0 +1,23 @@
# Exercise 5 Groupby Agg
The goal of this exercise is to learn to compute different type of aggregations on the groups. This small DataFrame contains products and prices.
| | value | product |
|---:|--------:|:-------------|
| 0 | 20.45 | table |
| 1 | 22.89 | chair |
| 2 | 32.12 | chair |
| 3 | 111.22 | mobile phone |
| 4 | 33.22 | table |
| 5 | 100 | mobile phone |
| 6 | 99.99 | table |
1. Compute the min, max and mean price for each product in one single line of code. The expected output is:
| product | ('value', 'min') | ('value', 'max') | ('value', 'mean') |
|:-------------|-------------------:|-------------------:|--------------------:|
| chair | 22.89 | 32.12 | 27.505 |
| mobile phone | 100 | 111.22 | 105.61 |
| table | 20.45 | 99.99 | 51.22 |
Note: The columns don't have to be MultiIndex

12
piscine/week01/day04/ex06/audit/readme.md

@ -0,0 +1,12 @@
##### The exercice is validated is all questions of the exercice are validated
The question 1 is validated if the output is similar to what `unstacked_df.head()`returns:
| Date | ('Prediction', 'AAPL') | ('Prediction', 'AMZN') | ('Prediction', 'DAI') | ('Prediction', 'FB') | ('Prediction', 'GE') |
|:--------------------|-------------------------:|-------------------------:|------------------------:|-----------------------:|-----------------------:|
| 2021-01-01 00:00:00 | 0.382312 | -0.072392 | -0.551167 | -0.0585555 | 1.05955 |
| 2021-01-04 00:00:00 | -0.560953 | 0.503199 | -0.79517 | -3.23136 | 1.50271 |
| 2021-01-05 00:00:00 | 0.211489 | 1.84867 | 0.287906 | -1.81119 | 1.20321 |
##### The question 2 is validated if the answer is: `unstacked.plot(title = 'Stocks 2021')`. The title can be anything else.

32
piscine/week01/day04/ex06/readme.md

@ -0,0 +1,32 @@
# Exercise 6 Unstack
The goal of this exercise is to learn to unstack a MultiIndex
Let's assume we trained a machine learning model that predicts a daily score on the companies (tickers) below. It may be very useful to unstack the MultiIndex: plot the time series, vectorize the backtest, ...
```python
business_dates = pd.bdate_range('2021-01-01', '2021-12-31')
#generate tickers
tickers = ['AAPL', 'FB', 'GE', 'AMZN', 'DAI']
#create indexs
index = pd.MultiIndex.from_product([business_dates, tickers], names=['Date', 'Ticker'])
# create DFs
market_data = pd.DataFrame(index=index,
data=np.random.randn(len(index), 1),
columns=['Prediction'])
```
1. Unstack the DataFrame.
The first 3 rows of the DataFrame should like this:
| Date | ('Prediction', 'AAPL') | ('Prediction', 'AMZN') | ('Prediction', 'DAI') | ('Prediction', 'FB') | ('Prediction', 'GE') |
|:--------------------|-------------------------:|-------------------------:|------------------------:|-----------------------:|-----------------------:|
| 2021-01-01 00:00:00 | 0.382312 | -0.072392 | -0.551167 | -0.0585555 | 1.05955 |
| 2021-01-04 00:00:00 | -0.560953 | 0.503199 | -0.79517 | -3.23136 | 1.50271 |
| 2021-01-05 00:00:00 | 0.211489 | 1.84867 | 0.287906 | -1.81119 | 1.20321 |
2. Plot the 5 times series in the same plot using Pandas built-in visualization functions with a title.

40
piscine/week01/day04/readme.md

@ -0,0 +1,40 @@
# W1D04 Piscine AI - Data Science
## Data wrangling with Pandas
Data wrangling is one of the crucial tasks in data science and analysis which includes operations like:
- Data Sorting: To rearrange values in ascending or descending order.
- Data Filtration: To create a subset of available data.
- Data Reduction: To eliminate or replace unwanted values.
- Data Access: To read or write data files.
- Data Processing: To perform aggregation, statistical, and similar operations on specific values.
Ax explained before, Pandas is an open source library, specifically developed for data science and analysis. It is built upon the Numpy (to handle numeric data in tabular form) package and has inbuilt data structures to ease-up the process of data manipulation, aka data munging/wrangling.
## Exercises of the day
- Exercise 1 Concatenate
- Exercise 2 Merge
- Exercise 3 Merge MultiIndex
- Exercise 4 Groupby Apply
- Exercise 5 Groupby Agg
- Exercise 6 Unstack
## Virtual Environment
- Python 3.x
- NumPy
- Pandas
- Jupyter or JupyterLab
*Version of Pandas I used to do the exercises: 1.0.1*.
I suggest to use the most recent one.
## Resources
- https://jakevdp.github.io/PythonDataScienceHandbook/
- https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
- https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/
- https://towardsdatascience.com/different-ways-to-iterate-over-rows-in-a-pandas-dataframe-performance-comparison-dc0d5dcef8fe

9
piscine/week01/day05/ex00/audit/readme.md

@ -0,0 +1,9 @@
##### The exercice is validated is all questions of the exercice are validated
##### Activate the virtual environment. If you used `conda` run `conda activate your_env`
##### Run `python --version`
###### Does it print `Python 3.x`? x >= 8
##### Does `import jupyter`, `import numpy` and `import pandas` run without any error ?

52
piscine/week01/day05/ex00/readme.md

@ -0,0 +1,52 @@
# W1D05 Piscine AI - Data Science
## Time Series with Pandas
Time series data are data that are indexed by a sequence of dates or times. Today, you'll learn how to use methods built into Pandas to work with this index. You'll also learn for instance:
- to resample time series to change the frequency
- to calculate rolling and cumulative values for times series
- to build a backtest
Time series a used A LOT in finance. You'll learn to evaluate financial strategies using Pandas. It is important to keep in mind that Python is vectorized. That's why some questions constraint you to not use a for loop ;-).
## Exercises of the day
- Exercise 1 Series
- Exercise 2 Financial data
- Exercise 3 Multi asset returns
- Exercise 4 Backtest
## Virtual Environment
- Python 3.x
- NumPy
- Pandas
- Jupyter or JupyterLab
*Version of Pandas I used to do the exercises: 1.0.1*.
I suggest to use the most recent one.
## Resources
- https://jakevdp.github.io/PythonDataScienceHandbook/
- https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
- https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/
- https://towardsdatascience.com/different-ways-to-iterate-over-rows-in-a-pandas-dataframe-performance-comparison-dc0d5dcef8fe
# Exercise 0 Environment and libraries
The goal of this exercise is to set up the Python work environment with the required libraries.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
I recommend to use:
- the **last stable versions** of Python.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy` and `jupyter`.

35
piscine/week01/day05/ex01/audit/readme.md

@ -0,0 +1,35 @@
##### The exercice is validated is all questions of the exercice are validated
##### The question 1 is validated if the output of is as below. The best solution uses `pd.date_range` to generate the index and `range` to generate the integer series.
```console
2010-01-01 0
2010-01-02 1
2010-01-03 2
2010-01-04 3
2010-01-05 4
...
2020-12-27 4013
2020-12-28 4014
2020-12-29 4015
2020-12-30 4016
2020-12-31 4017
Freq: D, Name: integer_series, Length: 4018, dtype: int64
```
##### This question is validated if the output is as below. If the `NaN` values have been dropped the solution is also accepted. The solution uses `rolling().mean()`.
```console
2010-01-01 NaN
2010-01-02 NaN
2010-01-03 NaN
2010-01-04 NaN
2010-01-05 NaN
...
2020-12-27 4010.0
2020-12-28 4011.0
2020-12-29 4012.0
2020-12-30 4013.0
2020-12-31 4014.0
Freq: D, Name: integer_series, Length: 4018, dtype: float64
```

7
piscine/week01/day05/ex01/readme.md

@ -0,0 +1,7 @@
# Exercise 1
The goal of this exercise is to learn to manipulate time series in Pandas.
1. Create a `Series` named `integer_series` from 1st January 2010 to 31 December 2020. At each date is associated the number of days since 1st January 2010. It starts with 0.
2. Using Pandas, compute a 7 days moving average. This transformation smooths the time series by removing small fluctuations. **without for loop**

43
piscine/week01/day05/ex02/audit/readme.md

@ -0,0 +1,43 @@
##### The exercice is validated is all questions of the exercice are validated
###### Have you checked missing values and data types ?
###### Have you converted string dates to datetime ?
###### Have you set dates as index ?
###### Have you used `info` or `describe` to have a first look at the data ?
##### The question 1 is validated if you inserted the right columns in `Candlestick` `Plotly` object. The Candlestick is based on Open, High, Low and Close columns. The index is Date (datetime).
##### This question 2 is validated if the output of `print(transformed_df.head().to_markdown())` is as below and if there are **482 months**.
| Date | Open | Close | Volume | High | Low |
|:--------------------|---------:|---------:|------------:|---------:|---------:|
| 1980-12-31 00:00:00 | 0.136075 | 0.135903 | 1.34485e+09 | 0.161272 | 0.112723 |
| 1981-01-30 00:00:00 | 0.141768 | 0.141316 | 6.08989e+08 | 0.155134 | 0.126116 |
| 1981-02-27 00:00:00 | 0.118215 | 0.117892 | 3.21619e+08 | 0.128906 | 0.106027 |
| 1981-03-31 00:00:00 | 0.111328 | 0.110871 | 7.00717e+08 | 0.120536 | 0.09654 |
| 1981-04-30 00:00:00 | 0.121811 | 0.121545 | 5.36928e+08 | 0.131138 | 0.108259 |
To get this result there are two ways: `resample` and `groupby`. There are two key steps:
- Find how to affect the aggregation on the last **business** day of each month. This is already implemented in Pandas and the keyword that should be used either in `resample` parameter or in `Grouper` is `BM`.
- Choose the right aggregation function for each variable. The prices (Open, Close and Adjusted Close) should be aggregated by taking the `mean`. Low should be aggregated by taking the `minimum` because it represents the lower price of the day, so the lowest price on the month is the lowest price of the lowest prices on the day. The same logic applied to High, leads to use the `maximum` to aggregate the High. Volume should be aggregated using the `sum` because the monthly volume is equal to the sum of daily volume over the month.
##### The question 3 is validated if it doesn't involve a for loop and the output is as below. The first way to do it is to compute the return without for loop is to use `pct_change`. And the second way to do it is to implement the formula given in the exercise in a vectorized way. To get the value at `t-1` you can use `shift`.
```console
Date
1980-12-12 NaN
1980-12-15 -0.047823
1980-12-16 -0.073063
1980-12-17 0.019703
1980-12-18 0.028992
...
2021-01-25 0.049824
2021-01-26 0.003704
2021-01-27 -0.001184
2021-01-28 -0.027261
2021-01-29 -0.026448
Name: Open, Length: 10118, dtype: float64
```

10120
piscine/week01/day05/ex02/data/AAPL.csv

File diff suppressed because it is too large diff.load

16
piscine/week01/day05/ex02/readme.md

@ -0,0 +1,16 @@
# Exercise 2
The goal of this exercise is to learn to use Pandas on Time Series an on Financial data.
The data we will use is Apple stock.
1. Using `Plotly` plot a Candlestick
2. Aggregate the data to **last business day of each month**. The aggregation should consider the meaning of the variables. How many months are in the considered period ?
3. When comparing many stocks between them the metric which is frequently used is the return of the price. The price is not a convenient metric as the prices evolve in different ranges. The return at time t is defined as
- (Price(t) - Price(t-1))/ Price(t-1)
Using the open price compute the **daily return**. Propose two different ways **without for loop**.

6
piscine/week01/day05/ex03/audit/readme.md

@ -0,0 +1,6 @@
##### This question is validated if, without having used a for loop, the outputted DataFrame shape's `(261, 5)` and your output is the same as the one return with this line of code. The DataFrame contains random data. Make sure your output and the one returned by this code is based on the same DataFrame.
```python
market_data.loc[market_data.index.get_level_values('Ticker')=='AAPL'].sort_index().pct_change()
```

31
piscine/week01/day05/ex03/readme.md

@ -0,0 +1,31 @@
# Exercise 3 Multi asset returns
The goal of this exercise is to learn to compute daily returns on a DataFrame that contains many assets (multi-assets).
```python
business_dates = pd.bdate_range('2021-01-01', '2021-12-31')
#generate tickers
tickers = ['AAPL', 'FB', 'GE', 'AMZN', 'DAI']
#create indexs
index = pd.MultiIndex.from_product([business_dates, tickers], names=['Date', 'Ticker'])
# create DFs
market_data = pd.DataFrame(index=index,
data=np.random.randn(len(index), 1),
columns=['Price'])
```
1. **Without using a for loop**, compute the daily returns (return(d) = (price(d)-price(d-1))/price(d-1)) for all the companies and returns a DataFrame as:
| Date | ('Price', 'AAPL') | ('Price', 'AMZN') | ('Price', 'DAI') | ('Price', 'FB') | ('Price', 'GE') |
|:--------------------|--------------------:|--------------------:|-------------------:|------------------:|------------------:|
| 2021-01-01 00:00:00 | nan | nan | nan | nan | nan |
| 2021-01-04 00:00:00 | 1.01793 | 0.0512955 | 3.84709 | -0.503488 | 0.33529 |
| 2021-01-05 00:00:00 | -0.222884 | -1.64623 | -0.71817 | -5.5036 | -4.15882 |
Note: The data is generated randomly, the values you may have a different results. But, this shows the expected DataFrame structure.
`Hint use groupby`

62
piscine/week01/day05/ex04/audit/readme.md

@ -0,0 +1,62 @@
##### The exercice is validated is all questions of the exercice are validated
###### Have you checked missing values and data types ?
###### Have you converted string dates to datetime ?
###### Have you set dates as index ?
###### Have you used `info` or `describe` to have a first look at the data ?
**My results can be reproduced using: `np.random.seed = 2712`. Given the versions of NumPy used I do not guaranty the reproducibility of the results - that is why I also explain the steps to get to the solution.**
##### The question 1 is validated if the return is computed as: Return(t) = (Price(t+1) - Price(t))/Price(t) and returns this output. Note that if the index is not ordered in ascending order the futur return computed is wrong. The answer is also accepted if the returns is computed as in the exercise 2 and then shifted in the futur using `shift`, but I do not recommend this implementation as it adds missing values !
```console
Date
1980-12-12 -0.052170
1980-12-15 -0.073403
1980-12-16 0.024750
1980-12-17 0.029000
1980-12-18 0.061024
...
2021-01-25 0.001679
2021-01-26 -0.007684
2021-01-27 -0.034985
2021-01-28 -0.037421
2021-01-29 NaN
Name: Daily_futur_returns, Length: 10118, dtype: float64
```
An example of solution is:
```python
def compute_futur_return(price):
return (price.shift(-1) - price)/price
compute_futur_return(df['Adj Close'])
```
##### The question 2 is validated if the index of the Series is the same as the index of the DataFrame. The data of the series can be generated using `np.random.randint(0,2,len(df.index)`.
##### This question is validated if the Pnl is computed as: signal * futur_return. Both series should have the same index.
```console
Date
1980-12-12 -0.052170
1980-12-15 -0.073403
1980-12-16 0.024750
1980-12-17 0.029000
1980-12-18 0.061024
...
2021-01-25 0.001679
2021-01-26 -0.007684
2021-01-27 -0.034985
2021-01-28 -0.037421
2021-01-29 NaN
Name: PnL, Length: 10119, dtype: float64
```
##### The question 4 is validated if you computed the return of the strategy as: `(Total earned - Total invested) / Total` invested. The result should be close to 0. The formula given could be simplified as `(PnLs.sum())/signal.sum()`. My return is: 0.00043546984088551553 because I invested 5147$ and I earned 5149$.
##### The question is validated if you replaced the previous signal Series with 1s. Similarly as the previous question, we earned 10128$ and we invested 10118$ which leads to a return of 0.00112670194140969 (0.1%).

44
piscine/week01/day05/ex04/readme.md

@ -0,0 +1,44 @@
# Exercise 4 Backtest
The goal of this exercise is to learn to perform a backtest in Pandas. A backtest is a tool that allows you to know how a strategy would have performed retrospectively using historical data. In this exercise we will focus on the backtesting tool and not on how to build the best strategy.
We will backtest a **long only** strategy on Apple Inc. Long only means that we only consider buying the stock. The input signal at date d says if the close price will increase at d+1. We assume that the input signal is available before the market closes.
1. Drop the rows with missing values and compute the daily futur return on the Apple stock on the adjusted close price. The daily futur return means: **Return(t) = (Price(t+1) - Price(t))/Price(t)**.
There are some events as splits or dividents that artificially change the price of the stock. That is why the close price is adjusted to avoid to have outliers in the price data.
2. Create a Series that contains a random boolean array with **p=0.5**
```console
Here an example of the expected time series
2010-01-01 1
2010-01-02 0
2010-01-03 0
2010-01-04 1
2010-01-05 0
Freq: D, Name: long_only_signal, dtype: int64
```
- The information is this series should be interpreted this way:
- On the 2010-01-01 I receive `1` before the market closes meaning that, if I trust the signal, the close price of day d+1 will increase. I should buy the stock before the market closes.
- On the 2010-01-02 I receive `0` before the market closes meaning that,, if I trust the signal, the close price of day d+1 will not increase. I should not buy the stock.
3. Backtest the signal created in Question 2. Here are some assumptions made to backtest this signal:
- When, at date d, the signal equals 1 we buy 1$ of stock just before the market closes and we sell the stock just before the market closes the next day.
- When, at date d, the signal equals 0, we do not buy anything.
- The profit is not reinvested, when invested, the amount is always 1$.
- Fees are not considered
**The expected output** is a **Series that gives for each day the return of the strategy. The return of the strategy is the PnL (Profit and Losses) divided by the invested amount**. The PnL for day d is:
`(money earned this day - money invested this day)`
Let's take the example of a 20% return for an invested amount of 1$. The PnL is `(1,2 - 1) = 0.2`. We notice that the PnL when the signal is 1 equals the daily return. The Pnl when the signal is 0 is 0.
By convention, we consider that the PnL of d is affected to day d and not d+1, even if the underlying return contains the information of d+1.
**The usage of for loop is not allowed**.
4. Compute the return of the strategy. The return of the strategy is defined as: `(Total earned - Total invested) / Total invested`
5. Now the input signal is: **always buy**. Compute the daily PnL and the total PnL. Plot the daily PnL of Q5 and of Q3 on the same plot
- https://www.investopedia.com/terms/b/backtesting.asp

37
piscine/week01/day05/readme.md

@ -0,0 +1,37 @@
# W1D05 Piscine AI - Data Science
## Time Series with Pandas
Time series data are data that are indexed by a sequence of dates or times. Today, you'll learn how to use methods built into Pandas to work with this index. You'll also learn for instance:
- to resample time series to change the frequency
- to calculate rolling and cumulative values for times series
- to build a backtest
Time series a used A LOT in finance. You'll learn to evaluate financial strategies using Pandas. It is important to keep in mind that Python is vectorized. That's why some questions constraint you to not use a for loop ;-).
## Exercises of the day
- Exercise 1 Series
- Exercise 2 Financial data
- Exercise 3 Multi asset returns
- Exercise 4 Backtest
## Virtual Environment
- Python 3.x
- NumPy
- Pandas
- Jupyter or JupyterLab
*Version of Pandas I used to do the exercises: 1.0.1*.
I suggest to use the most recent one.
## Resources
- https://jakevdp.github.io/PythonDataScienceHandbook/
- https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
- https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/
- https://towardsdatascience.com/different-ways-to-iterate-over-rows-in-a-pandas-dataframe-performance-comparison-dc0d5dcef8fe

105
piscine/week01/raid01/audit/readme.md

@ -0,0 +1,105 @@
# RAID01 - Backtesting on the SP500 - audit
### Preliminary
###### Does the structure of the project is as below ?
```
project
│ README.md
│ environment.yml
│
└───data
│ │ sp500.csv
│ | prices.csv
│
└───notebook
│ │ analysis.ipynb
|
|───scripts
| │ memory_reducer.py
| │ preprocessing.py
| │ create_signal.py
| | backtester.py
│ | main.py
│
└───results
│ plots
│ results.txt
│ outliers.txt
```
###### Does the readme file contain a description of the project, explain how to run the code from an empty environment, give a summary of the implementation of each python file and contain a conclusion that gives the performance of the strategy?
###### Does the environment contain all libraries used and their versions that are necessary to run the code ?
###### Does the notebook contain a missing values analysis? **Example**: number of missing values per variables or per year
###### Does the notebook contain an outliers analysis?
###### Does the notebook contain a Histogram of average price for companies for all variables (saved the plot with the images) ? This is required only for `prices.csv` data.
###### Does the notebook describe at least 5 outliers ('ticker', 'date', price) ? To check the outliers it is simple: Search the historical stock price on Google at the given date and compare. The price may fluctuate a bit. The goal here is not to match the historical price found on Google but to detect a huge difference between the price in our data and the real historical one.
Notes:
- For all questions always check the values are sorted by date. If not the answers are wrong.
- The plots are validated only if they contain a title
## Python files
### 1. memory_reducer.py
###### Do the `prices` data set weights less than **8MB** (Mega Bytes) ?
###### Do the `sp500` data set weights less than **0.15MB** (Mega Bytes) ?
###### Do the data type is greater than `np.float32` ? Smaller data type may alter the precision of the data.
### 2. preprocessing.py
##### The data is agregated on a monthly period and only the last element is kept
##### The outliers are filtered out by removing all prices bigger than 10k $ and smaller than 0.1 $
##### The historical return is computed using only current and past values.
##### The futur return is computed using only current and futur value. (Reminder: as the data is resampled monthly, computing the return is straightforward)
##### The outliers in the returns data is set to NaN for all returns not in the years 2008 and 2009. The filters are: return > 1 and return < -0.5.
##### The missing values are filled using the last value available **for the company**. `df.fillna(method='ffill')` is wrong because the previous value can be the return or price of another company.
##### The missing values that can't be filled using a the previous existing value are dropped.
##### The number of missing values is 0
Best practice:
Do not fill the last values for the futur return because the values are missing because the data set ends at a given date. Filling the previous doesn't make sense. It makes more sense to drop the row because the backtest focuses on observed data.
### 3. create_signal.py
##### The metric `average_return_1y` is added as a new column if the merged DataFrame. The metric is relative to a company. It is important to group the data by company first before to compute the average return over 1y. It is accepted to consider that one year is 12 consecutive rows.
##### The signal is added as a new column to the merged DataFrame. The signal which is boolean indicates whether, within the same month, the company is in the top 20. The top 20 corresponds to the 20 companies with the 20 highest metric within the same month. The highest metric gets the rank 1 (if rank is used the parameter `ascending` should be set to `False`).
### 4. backtester.py
##### The PnL is computed by multiplying the signal `Series` by the **futur returns**.
##### The return of the strategy is computed by dividing the PnL by the sum of the signal `Series`.
##### The signal used on the SP500 is the `pd.Series([20,20,...,20])`
##### The series used in the plot are the cumulative PnL. `cumsum` can be used
##### The PnL on the full historical data is **smaller than 75$**. If not, it means that the outliers where not corrected correctly.
###### Does the plot contain a title ?
###### Does the plot contain a legend ?
###### Does the plot contain a x-axis and y-axis name ?
![alt text][performance]
[performance]: ../images/w1_weekend_plot_pnl.png "Cumulative Performance"
### 5. main.py
###### The command `python main.py` executes the code from data imports to the backtest and save the results? It shouldn't return any error to validate the project.

9032
piscine/week01/raid01/data/fundamentals.csv

File diff suppressed because it is too large diff.load

3771
piscine/week01/raid01/data/sp500.csv

File diff suppressed because it is too large diff.load

3645
piscine/week01/raid01/data/stock_prices.csv

diff.file_suppressed_line_too_long

BIN
piscine/week01/raid01/images/w1_weekend_plot_pnl.png

diff.bin_not_shown

After

Width:  |  Height:  |  Size: 136 KiB

155
piscine/week01/raid01/readme.md

@ -0,0 +1,155 @@
# RAID01 - Backtesting on the SP500
## SP500 data preprocessing
The goal of this project is to perform a Backtest on the SP500 constituents. The SP500 is an index the 500 biggest capitalization in the US.
## Data
The input file are `stock_prices.csv` and :
- `sp500.csv` contains the SP500 data. The SP500 is a stock market index that measures the stock performance of 500 large companies listed on stock exchanges in the United States.
- `stock_prices.csv`: contains the close prices for all the companies that had been in the SP500. It contains a lot of missing data. The adjusted close price may be unavailable for three main reasons:
- The company doesn't exist at date d
- The company is not public, pas coté
- Its close price hasn't been reported
- Note: The quality of this data set is not good: some prices are wrong, there are some prices spikes, there are some prices adjustments (share split, dividend distribution) - the prices adjustment are corrected in the adjusted close. But I'm not providing this data for this project to let you understand what is bad quality data and how important it is to detect outliers and missing values. The idea is not to correct the full data set manually, but to correct the main problems.
_Note: The corrections will not fix the data, as a result the results may be abnormal compared to results from cleaned financial data. That's not a problem for this small project !_
## Problem
Once preprocessed this data, it will be used to generate a signal that is, for each asset at each date a metric that indicates if the asset price will increase the next month. At each date (once a month) we will take the 20 highest metrics and invest 1$ per company. This strategy is called **stock picking**. It consists in picking stock in an index and try to overperform the index. Finally we will compare the performance of our strategy compared to the benchmark: the SP500
It is important to understand that the SP500 components change over time. The reason is simple: Facebook entered the SP500 in 2013 thus meaning that another company had to be removed from the 500 companies.
The structure of the project is:
```console
project
│ README.md
│ environment.yml
│
└───data
│ │ sp500.csv
│ | prices.csv
│
└───notebook
│ │ analysis.ipynb
|
|───scripts
| │ memory_reducer.py
| │ preprocessing.py
| │ create_signal.py
| | backtester.py
│ | main.py
│
└───results
│ plots
│ results.txt
│ outliers.txt
```
There are four parts:
## 1. Preliminary
- Create a function that takes as input one CSV data file. This function should optimize the types to reduce its size and returns a memory optimized DataFrame.
- For `float` data the smaller data type used is `np.float32`
- These steps may help you to implement the memory_reducer:
1. Iterate over every column
2. Determine if the column is numeric
3. Determine if the column can be represented by an integer
4. Find the min and the max value
5. Determine and apply the smallest datatype that can fit the range of values
## 2. Data wrangling and preprocessing
- Create a Jupyter Notebook to analyse the data sets and perform EDA (Exploratory Data Analysis). This notebook should contain at least:
- Missing values analysis
- Outliers analysis (there are a lot of outliers)
- One of average price for companies for all variables (save the plot with the images).
- Describe at least 5 outliers ('ticker', 'date', 'price'). Put them in `outliers.txt` file with the 3 fields on the folder `results`.
_Note: create functions that generate the plots and save them in the images folder. Add a parameter `plot` with a default value `False` which doesn't return the plot. This will be useful for the correction to let people run your code without overriding your plots._
- Here is how the `prices` data should be preprocessed:
- Resample data on month and keep the last value
- Filter prices outliers: Remove prices outside of the range 0.1$, 10k$
- Compute monthly returns:
- Historical returns. **returns(current month) = price(current month) - price(previous month) / price(previous month)**
- Future returns. **returns(current month) = price(next month) - price(current month) / price(current month)**
- Replace returns outliers by the last value available regarding the company. This corrects prices spikes that corresponds to a monthly return greater than 1 and smaller than -0.5. This correction should not consider the 2008 and 2009 period as the financial crisis impacted the market brutally. **Don't forget that a value is considered as an outlier comparing to the other returns/prices of the same company**
At this stage the DataFrame should looks like this:
| | Price | monthly_past_return | monthly_future_return |
| :--------------------------------------------------- | ------: | ------------------: | -------------------: |
| (Timestamp('2000-12-31 00:00:00', freq='M'), 'A') | 36.7304 | nan | -0.00365297 |
| (Timestamp('2000-12-31 00:00:00', freq='M'), 'AA') | 25.9505 | nan | 0.101194 |
| (Timestamp('2000-12-31 00:00:00', freq='M'), 'AAPL') | 1.00646 | nan | 0.452957 |
| (Timestamp('2000-12-31 00:00:00', freq='M'), 'ABC') | 11.4383 | nan | -0.0528713 |
| (Timestamp('2000-12-31 00:00:00', freq='M'), 'ABT') | 38.7945 | nan | -0.07205 |
- Fill the missing values using the last available value (same company)
- Drop the missing values that can't be filled
- Print `prices.isna().sum()`
- Here is how the `sp500.csv` data should be preprocessed:
- Resample data on month and keep the last value
- Compute historical monthly returns on the adjusted close
## 3. Create signal
At this stage we have a data set with features that we will leverage to get an investment signal. As previously said, we will focus on one single variable to create the signal: **monthly_past_return**. The signal will be the average of monthly returns of the previous year
The naive assumption made here is that if a stock has performed well the last year it will perform well the next month. Moreover, we assume that we can buy stocks as soon as we have the signal (the signal is available at the close of day `d` and we assume that we can buy the stock at close of day `d`. The assumption is acceptable while considering monthly returns, because the difference between the close of day `d` and the open of day `d+1` is small comparing to the monthly return)
- Create a column `average_return_1y`
- Create a column named `signal` that contains `True` if `average_return_1y` is among the 20 highest in the month `average_return_1y`.
## 4. Backtester
At this stage we have an investment signal that indicates each month what are the 20 companies we should invest 1$ on (1$ each). In order to check the strategies and performance we will backtest our investment signal.
- Compute the PnL and the total return of our strategy without a for loop. Save the results in a text file `results.txt` in the folder `results`.
- Compute the PnL and the total return of the strategy that consists in investing 20$ each day on the SP500. Compare. Save the results in a text file `results.txt` in the folder `results`.
- Create a plot that shows the performance of the strategy over time for the SP500 and the Stock Picking 20 strategy.
A data point (x-axis: date, y-axis: cumulated_return) is: the **cumulated returns** from the beginning of the strategy at date `t`. Save the plot in the results folder.
> This plot is used a lot in Finance because it helps to compare a custom strategy with in index. In that case we say that the SP500 is used as **benchmark** for the Stock Picking Strategy.
![alt text][performance]
[performance]: images/w1_weekend_plot_pnl.png 'Cumulative Performance'
## 5. Main
Here is a sketch of `main.py`.
```python
# main.py
# import data
prices, sp500 = memory_reducer(paths)
# preprocessing
prices, sp500 = preprocessing(prices, sp500)
# create signal
prices = create_signal(prices)
#backtest
backtest(prices, sp500)
```
**The command `python main.py` executes the code from data imports to the backtest and save the results.**

9
piscine/week02/day01/ex00/audit/readme.md

@ -0,0 +1,9 @@
##### The exercice is validated is all questions of the exercice are validated
##### Activate the virtual environment. If you used `conda` run `conda activate your_env`
##### Run `python --version`
###### Does it print `Python 3.x`? x >= 8
##### Does `import jupyter`, `import numpy`, `import pandas`, `import matplotlib` and `import sklearn` run without any error ?

81
piscine/week02/day01/ex00/readme.md

@ -0,0 +1,81 @@
# W2D01 Piscine AI - Data Science
![Alt Text](w2_day01_linear_regression_video.gif)
## Linear regression with Scikit Learn
The goal of this day is to understand practical Linear regression and supervised learning.
The word "regression" was introduced by Sir Francis Galton (a cousin of C. Darwin) when he
studied the size of individuals within a progeny. He was trying to understand why
large individuals in a population appeared to have smaller children, more
close to the average population size; hence the introduction of the term "regression".
Today we will learn a basic algorithm used in **supervised learning** : **The Linear Regression**. We will be using **Scikit-learn** which is a machine learning library. It is designed to interoperate with the Python libraries NumPy and Pandas.
We will also learn progressively the Machine Learning methodology for supervised learning - today we will focus on evaluating a machine learning model by splitting the data set in a train set and a test set.
## Exercises of the day
- Exercise 0 Environment and libraries
- Exercise 1 Scikit-learn estimator
- Exercise 2 Linear regression in 1D
- Exercise 3 Train test split
- Exercise 4 Forecast diabetes progression
- Bonus: Exercise 5 Gradient Descent - **Optional**
## Virtual Environment
- Python 3.x
- NumPy
- Pandas
- Matplotlib
- Scikit Learn
- Jupyter or JupyterLab
*Version of Scikit Learn I used to do the exercises: 0.22*. I suggest to use the most recent one. Scikit Learn 1.0 is finally available after ... 14 years.
## Ressources
### To start with Scikit-learn
- https://scikit-learn.org/stable/tutorial/basic/tutorial.html
- https://jakevdp.github.io/PythonDataScienceHandbook/05.02-introducing-scikit-learn.html
- https://scikit-learn.org/stable/modules/linear_model.html
### Machine learning methodology and algorithms
- This course provides a broad introduction to machine learning, datamining, and statistical pattern recognition. Andrew Ng is a star in the Machine Learning community. I recommend to spend some time during the projects to focus on some algorithms. However, Python is not the language used for the course. https://www.coursera.org/learn/machine-learning
- https://docs.microsoft.com/en-us/azure/machine-learning/algorithm-cheat-sheet
- https://scikit-learn.org/stable/tutorial/index.html
### Linear Regression
- https://towardsdatascience.com/laymans-introduction-to-linear-regression-8b334a3dab09
- https://towardsdatascience.com/linear-regression-the-actually-complete-introduction-67152323fcf2
### Train test split
- https://machinelearningmastery.com/train-test-split-for-evaluating-machine-learning-algorithms/
- https://developers.google.com/machine-learning/crash-course/training-and-test-sets/video-lecture?hl=en
# Exercise 0 Environment and libraries
The goal of this exercise is to set up the Python work environment with the required libraries.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
I recommend to use:
- the **last stable versions** of Python.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy`, `jupyter`, `matplotlib` and `scikit-learn`.

22
piscine/week02/day01/ex01/audit/readme.md

@ -0,0 +1,22 @@
##### The exercice is validated is all questions of the exercice are validated
##### The question 1 is validated if the output of the fitted model is:
```python
LinearRegression(copy_X=True, fit_intercept=[[1], [2.1], [3]], n_jobs=None,
normalize=[[1], [2], [3]])
```
##### The question 2 is validated if the output is:
```python
array([[3.96013289]])
```
##### The question 3 is validated if the output is:
```output
Coefficients: [[0.99667774]]
Intercept: [-0.02657807]
Score: 0.9966777408637874
```

13
piscine/week02/day01/ex01/readme.md

@ -0,0 +1,13 @@
# Exercise 1 Scikit-learn estimator
The goal of this exercise is to learn to fit a Scikit-learn estimator and use it to predict.
```console
X, y = [[1],[2.1],[3]], [[1],[2],[3]]
```
1. Fit a LinearRegression from Scikit-learn with X the features and y the target.
2. Predict for `x_pred = [[4]]`
3. Print the coefficients (`coefs_`) and the intercept (`intercept_`), the score (`score`) of the regression of X and y.

Some files were not shown because too many files changed in this diff diff.show_more

Loading…
Cancel
Save