@ -0,0 +1,23 @@
|
||||
##### The exercice is validated is all questions of the exercice are validated |
||||
|
||||
##### Activate the virtual environment. If you used `conda` run `conda activate ex00` |
||||
|
||||
###### Does the shell specify the name `ex00` of the environment on the left ? |
||||
|
||||
##### Run `python --version` |
||||
|
||||
###### Does it print `Python 3.8.x`? x could be any number from 0 to 9. |
||||
|
||||
##### Does `import jupyter` and `import numpy` run without any error ? |
||||
|
||||
###### Have you used the followingthe command `jupyter notebook --port 8891` ? |
||||
|
||||
###### Is there a file named `Notebook_ex00.ipynb` in the working directory ? |
||||
|
||||
###### Is the following markdown code executed in a markdown cell in the first cell ? |
||||
|
||||
``` |
||||
# H1 TITLE |
||||
## H2 TITLE |
||||
``` |
||||
###### Does the second cell contain `print("Buy the dip ?")` and return `Buy the dip ?` in the output section ? |
@ -0,0 +1,62 @@
|
||||
# W1D01 Piscine AI - Data Science |
||||
|
||||
## NumPy |
||||
The goal of this day is to understand practical usage of **NumPy**. **NumPy** is a commonly used Python data analysis package. By using **NumPy**, you can speed up your workflow, and interface with other packages in the Python ecosystem, like scikit-learn, that use **NumPy** under the hood. **NumPy** was originally developed in the mid 2000s, and arose from an even older package called Numeric. This longevity means that almost every data analysis or machine learning package for Python leverages **NumPy** in some way. |
||||
|
||||
## Exercises of the day |
||||
|
||||
- Exercise 0 Environment and libraries |
||||
- Exercise 1 Your first NumPy array |
||||
- Exercise 2 Zeros |
||||
- Exercise 3 Slicing |
||||
- Exercise 4 Random |
||||
- Exercise 5 Split, concatenate, reshape arrays |
||||
- Exercise 6 Broadcasting and Slicing |
||||
- Exercise 7 NaN |
||||
- Exercise 8 Wine |
||||
- Exercise 9 Football tournament |
||||
|
||||
## Virtual Environment |
||||
- Python 3.x |
||||
- NumPy |
||||
- Jupyter or JupyterLab |
||||
|
||||
*Version of NumPy I used to do the exercises: 1.18.1*. |
||||
I suggest to use the most recent one. |
||||
|
||||
## Ressources |
||||
|
||||
- https://medium.com/fintechexplained/why-should-we-use-NumPy-c14a4fb03ee9 |
||||
- https://numpy.org/doc/ |
||||
- https://jakevdp.github.io/PythonDataScienceHandbook/ |
||||
|
||||
|
||||
# Exercise 0 Environment and libraries |
||||
|
||||
The goal of this exercise is to set up the Python work environment with the required libraries and to learn to launch a `jupyter notebook`. Jupyter notebooks are very convenient as they allow to write and test code within seconds. However, it really easy to implement instable and not reproducible code using notebooks. Keep the notebook and the underlying code clean. An article below detail when the Notebook should be used. Notebook can be used for most of the exercices of the piscine as the goal is to experiment A LOT. But no worries, you'll be asked to build a more robust structure for all the projects. |
||||
|
||||
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries. |
||||
|
||||
I recommend to use: |
||||
|
||||
- the **last stable versions** of Python. However, for educational purpose you will install a specific version of Python in this exercise. |
||||
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science. |
||||
- one of the most recents versions of the libraries required |
||||
|
||||
1. Create a virtual environment named `ex00`, with Python `3.8`, with the following libraries: `numpy`, `jupyter`. |
||||
|
||||
2. Launch a `jupyter notebook` on port `8891` and create a notebook named `Notebook_ex00`. `JupyterLab` can be used instead of Jupyter Notebook here. |
||||
|
||||
3. Put the text `H1 TITLE` as **heading level 1** and `H2 TITLE` as **heading level 2** in the first cell. |
||||
|
||||
4. Run `print("Buy the dip ?")` in the second cell |
||||
|
||||
|
||||
## Ressources: |
||||
|
||||
- https://www.python.org/ |
||||
- https://docs.conda.io/ |
||||
- https://jupyter.org/ |
||||
- https://numpy.org/ |
||||
- https://towardsdatascience.com/jypyter-notebook-shortcuts-bf0101a98330 |
||||
- https://odsc.medium.com/why-you-should-be-using-jupyter-notebooks-ea2e568c59f2 |
@ -0,0 +1,19 @@
|
||||
##### This exercise is validated if the your_numpy_array is a NumPy array. It can be checked with `type(your_numpy_array)` that should be equal to `numpy.ndarray`. And if the type of is element are as follow. |
||||
|
||||
##### Try and run the following code. |
||||
|
||||
```python |
||||
for i in your_np_array: |
||||
print(type(i)) |
||||
|
||||
<class 'int'> |
||||
<class 'float'> |
||||
<class 'str'> |
||||
<class 'dict'> |
||||
<class 'list'> |
||||
<class 'tuple'> |
||||
<class 'set'> |
||||
<class 'bool'> |
||||
``` |
||||
|
||||
###### Does it display the right types as above? |
@ -0,0 +1,21 @@
|
||||
# Exercise 1 Your first NumPy array |
||||
|
||||
The goal of this exercise is to use many Python data types in **NumPy** arrays. **NumPy** arrays are intensively used in **NumPy** and **Pandas**. They are flexible and allow to use optimized **NumPy** underlying functions. |
||||
|
||||
1. Create a NumPy array that contains: an integer, a float, a string, a dictionary, a list, a tuple, a set and a boolean. |
||||
|
||||
The expected output is: |
||||
|
||||
```python |
||||
for i in your_np_array: |
||||
print(type(i)) |
||||
|
||||
<class 'int'> |
||||
<class 'float'> |
||||
<class 'str'> |
||||
<class 'dict'> |
||||
<class 'list'> |
||||
<class 'tuple'> |
||||
<class 'set'> |
||||
<class 'bool'> |
||||
``` |
@ -0,0 +1,3 @@
|
||||
##### The exercice is validated is all questions of the exercice are validated |
||||
##### The question 1 is validated is the solution uses `np.zeros` and if the shape of the array is `(300,)` |
||||
##### The question 2 is validated if the solution uses `reshape` and the shape of the array is `(3, 100)` |
@ -0,0 +1,6 @@
|
||||
# Exercise 2 Zeros |
||||
|
||||
The goal of this exercise is to learn to create a NumPy array with 0s. |
||||
|
||||
1. Create a NumPy array of dimension **300** with zeros without filling it manually |
||||
2. Reshape it to **(3,100)** |
@ -0,0 +1,15 @@
|
||||
##### The exercice is validated is all questions of the exercice are validated |
||||
|
||||
|
||||
##### The question 1 is validated if the solution doesn't involve a for loop or writing all integers from 1 to 100 and if the array is: `np.array([1,...,100])`. The list from 1 to 100 can be generated with an iterator: `range`. |
||||
|
||||
##### The question 2 is validated if the solution is: `integers[::2]` |
||||
|
||||
##### The question 3 is validated if the solution is: `integers[::-2]` |
||||
|
||||
##### The question 4 is validated if the array is: `np.array([0, 1,0,3,4,0,...,0,99,100])`. There are at least two ways to get this results without for loop. The first one uses `integers[1::3] = 0` and the second involves creating a boolean array that indexes the array: |
||||
|
||||
```python |
||||
mask = (integers+1)%3 == 0 |
||||
integers[mask] = 0 |
||||
``` |
@ -0,0 +1,9 @@
|
||||
# Exercise 3 Slicing |
||||
|
||||
The goal of this exercise is to learn NumPy indexing/slicing. It allows to access values of the NumPy array efficiently and without a for loop. |
||||
|
||||
1. Create a NumPy array of dimension 1 that contains all integers from 1 to 100 ordered. |
||||
2. Without using a for loop and using the array created in Q1, create an array that contain all odd integers. The expected output is: `np.array([1,3,...,99])`. *Hint*: it takes one line |
||||
3. Without using a for loop and using the array created in Q1, create an array that contain all even integers reversed. The expected output is: `np.array([100,98,...,2])`. *Hint*: it takes one line |
||||
|
||||
4. Using array of Q1, set the value of every 3 elements of the list (starting with the second) to 0. The expected output is: `np.array([[1,0,3,4,0,...,0,99,100]])` |
@ -0,0 +1,40 @@
|
||||
##### The exercice is validated is all questions of the exercice are validated |
||||
|
||||
##### For this exercise, as the results may change depending on the version of the package or the OS, I give the code to correct the exercise. If the code is correct and the output is not the same as mine, it is accepted. |
||||
|
||||
##### The question 1 is validated if the solution is: `np.random.seed(888)` |
||||
|
||||
##### The question 2 is validated if the solution is: `np.random.randn(100)`. The value of the first element is `0.17620087373662233`. |
||||
|
||||
##### The question 3 is validated if the solution is: `np.random.randint(1,11,(8,8))`. |
||||
|
||||
```console |
||||
Given the NumPy version and the seed, you should have this output: |
||||
|
||||
array([[ 7, 4, 8, 10, 2, 1, 1, 10], |
||||
[ 4, 1, 7, 4, 3, 5, 2, 8], |
||||
[ 3, 9, 7, 4, 9, 6, 10, 5], |
||||
[ 7, 10, 3, 10, 2, 1, 3, 7], |
||||
[ 3, 2, 3, 2, 10, 9, 5, 4], |
||||
[ 4, 1, 9, 7, 1, 4, 3, 5], |
||||
[ 3, 2, 10, 8, 6, 3, 9, 4], |
||||
[ 4, 4, 9, 2, 8, 5, 9, 5]]) |
||||
``` |
||||
|
||||
##### The question 4 is validated if the solution is: `np.random.randint(1,18,(4,2,5))`. |
||||
|
||||
```console |
||||
Given the NumPy version and the seed, you should have this output: |
||||
|
||||
array([[[14, 16, 8, 15, 14], |
||||
[17, 13, 1, 4, 17]], |
||||
|
||||
[[ 7, 15, 2, 8, 3], |
||||
[ 9, 4, 13, 9, 15]], |
||||
|
||||
[[ 5, 11, 11, 14, 10], |
||||
[ 2, 1, 15, 3, 3]], |
||||
|
||||
[[ 3, 10, 5, 16, 13], |
||||
[17, 12, 9, 7, 16]]]) |
||||
``` |
@ -0,0 +1,17 @@
|
||||
# Exercise 4 Random |
||||
|
||||
The goal of this exercise is to learn to generate random data. |
||||
In Data Science it is extremely useful to generate random data for many reasons: |
||||
Lack of real data, create a random benchmark, use varied data sets. |
||||
NumPy proposes a lot of options to generate random data. In statistics, assumptions are made on the distribution the data is from. All data distribution that can be generated randomly are described in the documentation. In this exercise we will focus on two distributions: |
||||
|
||||
- Uniform: For example, if your goal is to generate a random number from 1 to 100 and that the probability that all the numbers is equal you'll need the uniform distribution. NumPy provides `randint` and `uniform` to generate uniform distribution |
||||
|
||||
- Normal: The normal distribution is the most important probability distribution in statistics because it fits many natural phenomena.For example, if you need to generate a data sample that represents **Heights of 14 Year Old Girls** it can be done using the normal distribution. In that case, we need two parameters: the mean (1m51) and the standard deviation (0.0741m). NumPy provides `randn` to generate normal distribution (among other) |
||||
|
||||
https://numpy.org/doc/stable/reference/random/generator.html |
||||
|
||||
1. Set the seed to 888 |
||||
2. Generate a **one-dimensional** array of size 100 with a normal distribution |
||||
3. Generate a **two-dimensional** array of size 8,8 with random integers from 1 to 10 - both included (same probability for each integer) |
||||
4. Generate a **three-dimensional** of size 4,2,5 array with random integers from 1 to 17 - both included (same probability for each integer) |
@ -0,0 +1,19 @@
|
||||
##### The exercice is validated is all questions of the exercice are validated |
||||
|
||||
##### The question 1 is validated if the generated array is based on an iterator as `range` or `np.arange`. Check that 50 is part of the array. |
||||
|
||||
##### The question 2 is validated if the generated array is based on an iterator as `range` or `np.arange`. Check that 100 is part of the array. |
||||
|
||||
##### The question 3 is validated if you concatenated this way `np.concatenate(array1,array2)`. |
||||
|
||||
##### The question 4 is validated if the result is: |
||||
|
||||
```console |
||||
array([[ 1, ... , 10], |
||||
... |
||||
[ 91, ... , 100]]) |
||||
``` |
||||
|
||||
The easiest way is to use `array.reshape(10,10)`. |
||||
|
||||
https://jakevdp.github.io/PythonDataScienceHandbook/ (section: The Basics of NumPy Arrays) |
@ -0,0 +1,17 @@
|
||||
# Exercise 5: Split, concatenate, reshape arrays |
||||
|
||||
The goal of this exercise is to learn to concatenate and reshape arrays. |
||||
|
||||
1. Generate an array with integers from 1 to 50: `array([1,...,50])` |
||||
|
||||
2. Generate an array with integers from 51 to 100: `array([51,...,100])` |
||||
|
||||
3. Using `np.concatenate`, concatenate the two arrays into: `array([1,...,100])` |
||||
|
||||
4. Reshape the previous array into: |
||||
|
||||
```console |
||||
array([[ 1, ... , 10], |
||||
... |
||||
[ 91, ... , 100]]) |
||||
``` |
@ -0,0 +1,28 @@
|
||||
##### The exercice is validated is all questions of the exercice are validated |
||||
|
||||
##### The question 1 is validated if the output is the same as: |
||||
`np.ones([9,9], dtype=np.int8)` |
||||
|
||||
##### The question 2 is validated if the output is |
||||
|
||||
```console |
||||
array([[1, 1, 1, 1, 1, 1, 1, 1, 1], |
||||
[1, 0, 0, 0, 0, 0, 0, 0, 1], |
||||
[1, 0, 1, 1, 1, 1, 1, 0, 1], |
||||
[1, 0, 1, 0, 0, 0, 1, 0, 1], |
||||
[1, 0, 1, 0, 1, 0, 1, 0, 1], |
||||
[1, 0, 1, 0, 0, 0, 1, 0, 1], |
||||
[1, 0, 1, 1, 1, 1, 1, 0, 1], |
||||
[1, 0, 0, 0, 0, 0, 0, 0, 1], |
||||
[1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int8) |
||||
``` |
||||
|
||||
##### The solution of question 2 is not accepted if the values of the array have been changed one by one manually. The usage of the for loop is not allowed neither. |
||||
Here is an example of a possible solution: |
||||
|
||||
```console |
||||
x[1:8,1:8] = 0 |
||||
x[2:7,2:7] = 1 |
||||
x[3:6,3:6] = 0 |
||||
x[4,4] = 1 |
||||
``` |
@ -0,0 +1,20 @@
|
||||
# Exercise 6: Broadcasting and Slicing |
||||
|
||||
The goal of this exercise is to learn to access values of n-dimensional arrays efficiently. |
||||
|
||||
1. Create an 2-dimensional array size 9,9 of 1s. Each value has to be an `int8`. |
||||
2. Using **slicing**, output this array: |
||||
|
||||
```python |
||||
array([[1, 1, 1, 1, 1, 1, 1, 1, 1], |
||||
[1, 0, 0, 0, 0, 0, 0, 0, 1], |
||||
[1, 0, 1, 1, 1, 1, 1, 0, 1], |
||||
[1, 0, 1, 0, 0, 0, 1, 0, 1], |
||||
[1, 0, 1, 0, 1, 0, 1, 0, 1], |
||||
[1, 0, 1, 0, 0, 0, 1, 0, 1], |
||||
[1, 0, 1, 1, 1, 1, 1, 0, 1], |
||||
[1, 0, 0, 0, 0, 0, 0, 0, 1], |
||||
[1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int8) |
||||
``` |
||||
|
||||
https://jakevdp.github.io/PythonDataScienceHandbook/ (section: Computation on Arrays: Broadcasting) |
@ -0,0 +1,32 @@
|
||||
##### The exercice is validated is all questions of the exercice are validated |
||||
|
||||
##### This question is validated if, without having used a for loop or having filled the array manually, the output is: |
||||
|
||||
```console |
||||
[[ 7. 1. 7.] |
||||
[nan 2. 2.] |
||||
[nan 8. 8.] |
||||
[ 9. 3. 9.] |
||||
[ 8. 9. 8.] |
||||
[nan 2. 2.] |
||||
[ 8. 2. 8.] |
||||
[nan 6. 6.] |
||||
[ 9. 2. 9.] |
||||
[ 8. 5. 8.]] |
||||
``` |
||||
|
||||
There are two steps in this exercise: |
||||
|
||||
- Create the vector that contains the grade of the first exam if available or the second. This can be done using `np.where`: |
||||
|
||||
```python |
||||
np.where(np.isnan(grades[:, 0]), grades[:, 1], grades[:, 0]) |
||||
``` |
||||
|
||||
- Add this vector as third column of the array. Here are two ways: |
||||
|
||||
```python |
||||
np.insert(arr = grades, values = new_vector, axis = 1, obj = 2) |
||||
|
||||
np.hstack((grades, new_vector[:, None])) |
||||
``` |
@ -0,0 +1,18 @@
|
||||
# Exercise 7: NaN |
||||
|
||||
The goal of this exercise is to learn to deal with missing data in NumPy and to manipulate NumPy arrays. |
||||
|
||||
Let us consider a 2-dimensional array that contains the grades at the past two exams. Some of the students missed the first exam. As the grade is missing it has been replaced with a `NaN`. |
||||
|
||||
1. Using `np.where` create a third column that is equal to the grade of the first exam if it exists and the second else. Add the column as the third column of the array. |
||||
|
||||
**Using a for loop or if/else statement is not allowed in this exercise.** |
||||
|
||||
```python |
||||
import numpy as np |
||||
|
||||
generator = np.random.default_rng(123) |
||||
grades = np.round(generator.uniform(low = 0.0, high = 10.0, size = (10, 2))) |
||||
grades[[1,2,5,7], [0,0,0,0]] = np.nan |
||||
print(grades) |
||||
``` |
@ -0,0 +1,52 @@
|
||||
1. This question is validated if the text file has successfully been loaded in a NumPy array with |
||||
`genfromtxt('winequality-red.csv', delimiter=',')` and the reduced arrays weights **76800 bytes** |
||||
|
||||
2. This question is validated if the output is |
||||
|
||||
```python |
||||
array([[ 7.4 , 0.7 , 0. , 1.9 , 0.076 , 11. , 34. , |
||||
0.9978, 3.51 , 0.56 , 9.4 , 5. ], |
||||
[ 7.4 , 0.66 , 0. , 1.8 , 0.075 , 13. , 40. , |
||||
0.9978, 3.51 , 0.56 , 9.4 , 5. ], |
||||
[ 6.7 , 0.58 , 0.08 , 1.8 , 0.097 , 15. , 65. , |
||||
0.9959, 3.28 , 0.54 , 9.2 , 5. ]]) |
||||
``` |
||||
|
||||
This slicing gives the answer `my_data[[1,6,11],:]`. |
||||
|
||||
3. This question is validated if the answer if False. There many ways to get the answer: find the maximum or check values greater than 20. |
||||
|
||||
4. This question is validated if the answer is 10.422983114446529. |
||||
|
||||
5. This question is validated if the answers is: |
||||
|
||||
```console |
||||
pH stats |
||||
25 percentile: 3.21 |
||||
50 percentile: 3.31 |
||||
75 percentile: 3.4 |
||||
mean: 3.3111131957473416 |
||||
min: 2.74 |
||||
max: 4.01 |
||||
``` |
||||
|
||||
> *Note: Using `percentile` or `median` may give different results depending on the duplicate values in the column. If you do not have my results please use `percentile`.* |
||||
|
||||
6. This question is validated if the answer is ~`5.2`. The first step is to get the percentile 20% of the column `sulphates`, then create a boolean array that contains `True` of the value is smaller than the percentile 20%, then select this rows with the column quality and compute the `mean`. |
||||
|
||||
7. This question is validated if the output for the best wines is: |
||||
|
||||
```python |
||||
array([ 8.56666667, 0.42333333, 0.39111111, 2.57777778, 0.06844444, |
||||
13.27777778, 33.44444444, 0.99521222, 3.26722222, 0.76777778, |
||||
12.09444444, 8. ]) |
||||
``` |
||||
|
||||
And the output for the bad wines is: |
||||
|
||||
```python |
||||
array([ 8.36 , 0.8845 , 0.171 , 2.635 , 0.1225 , 11. , |
||||
24.9 , 0.997464, 3.398 , 0.57 , 9.955 , 3. ]) |
||||
``` |
||||
|
||||
This can be done in three steps: Get the max, create a boolean mask that indicates rows with max quality, use this mask to subset the rows with the best quality and compute the mean on the axis 0. |
@ -0,0 +1,72 @@
|
||||
Citation Request: |
||||
This dataset is public available for research. The details are described in [Cortez et al., 2009]. |
||||
Please include this citation if you plan to use this database: |
||||
|
||||
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. |
||||
Modeling wine preferences by data mining from physicochemical properties. |
||||
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236. |
||||
|
||||
Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 |
||||
[Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf |
||||
[bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib |
||||
|
||||
1. Title: Wine Quality |
||||
|
||||
2. Sources |
||||
Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009 |
||||
|
||||
3. Past Usage: |
||||
|
||||
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. |
||||
Modeling wine preferences by data mining from physicochemical properties. |
||||
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236. |
||||
|
||||
In the above reference, two datasets were created, using red and white wine samples. |
||||
The inputs include objective tests (e.g. PH values) and the output is based on sensory data |
||||
(median of at least 3 evaluations made by wine experts). Each expert graded the wine quality |
||||
between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model |
||||
these datasets under a regression approach. The support vector machine model achieved the |
||||
best results. Several metrics were computed: MAD, confusion matrix for a fixed error tolerance (T), |
||||
etc. Also, we plot the relative importances of the input variables (as measured by a sensitivity |
||||
analysis procedure). |
||||
|
||||
4. Relevant Information: |
||||
|
||||
The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. |
||||
For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009]. |
||||
Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables |
||||
are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.). |
||||
|
||||
These datasets can be viewed as classification or regression tasks. |
||||
The classes are ordered and not balanced (e.g. there are munch more normal wines than |
||||
excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent |
||||
or poor wines. Also, we are not sure if all input variables are relevant. So |
||||
it could be interesting to test feature selection methods. |
||||
|
||||
5. Number of Instances: red wine - 1599; white wine - 4898. |
||||
|
||||
6. Number of Attributes: 11 + output attribute |
||||
|
||||
Note: several of the attributes may be correlated, thus it makes sense to apply some sort of |
||||
feature selection. |
||||
|
||||
7. Attribute information: |
||||
|
||||
For more information, read [Cortez et al., 2009]. |
||||
|
||||
Input variables (based on physicochemical tests): |
||||
1 - fixed acidity |
||||
2 - volatile acidity |
||||
3 - citric acid |
||||
4 - residual sugar |
||||
5 - chlorides |
||||
6 - free sulfur dioxide |
||||
7 - total sulfur dioxide |
||||
8 - density |
||||
9 - pH |
||||
10 - sulphates |
||||
11 - alcohol |
||||
Output variable (based on sensory data): |
||||
12 - quality (score between 0 and 10) |
||||
|
||||
8. Missing Attribute Values: None |
@ -0,0 +1,24 @@
|
||||
|
||||
# Exercise 8: Wine |
||||
|
||||
The goal of this exercise is to learn to perform a basic data analysis on real data using NumPy. |
||||
|
||||
The data set that will be used for this exercise is the red wine data set. |
||||
|
||||
https://archive.ics.uci.edu/ml/datasets/wine+quality |
||||
|
||||
How to tell if a given 2D array has null columns? |
||||
|
||||
1. Using `genfromtxt` load the data and reduce the size of the numpy array by optimizing the types. The sum of absolute differences between the original data set and the "memory" optimized one has to be smaller than 1.10**-3. I suggest to use `np.float32`. Check that the numpy array weights **76800 bytes**. |
||||
|
||||
2. Print 2nd, 7th and 12th rows as a two dimensional array |
||||
|
||||
3. Is there any wine with a percentage of alcohol greater than 20% ? Return True or False |
||||
|
||||
4. What is the average % of alcohol on all wines in the data set ? If needed, drop `np.nan` values |
||||
|
||||
5. Compute the minimum, the maximum, the 25th percentile, the 50th percentile, the 75th percentile, the median (50th percentile) of the pH |
||||
|
||||
6. Compute the average quality of the wines having the 20% least sulphates |
||||
|
||||
7. Compute the mean of all variables for wines having the best quality. Same question for the wines having the worst quality |
@ -0,0 +1,6 @@
|
||||
This exercise is validated if the output is: |
||||
|
||||
```console |
||||
[[0 3 1 2 4] |
||||
[7 6 8 9 5]] |
||||
``` |
@ -0,0 +1,10 @@
|
||||
nan -9.480000000000000426e+00 1.415000000000000036e+01 1.126999999999999957e+01 -5.650000000000000355e+00 3.330000000000000071e+00 1.094999999999999929e+01 -2.149999999999999911e+00 5.339999999999999858e+00 -2.830000000000000071e+00 |
||||
9.480000000000000426e+00 nan 4.860000000000000320e+00 -8.609999999999999432e+00 7.820000000000000284e+00 -1.128999999999999915e+01 1.324000000000000021e+01 4.919999999999999929e+00 2.859999999999999876e+00 9.039999999999999147e+00 |
||||
-1.415000000000000036e+01 -1.126999999999999957e+01 nan 1.227999999999999936e+01 -2.410000000000000142e+00 6.040000000000000036e+00 -5.160000000000000142e+00 -3.870000000000000107e+00 -1.281000000000000050e+01 1.790000000000000036e+00 |
||||
5.650000000000000355e+00 -3.330000000000000071e+00 -1.094999999999999929e+01 nan -1.364000000000000057e+01 0.000000000000000000e+00 2.240000000000000213e+00 -3.609999999999999876e+00 -7.730000000000000426e+00 8.000000000000000167e-02 |
||||
2.149999999999999911e+00 -5.339999999999999858e+00 2.830000000000000071e+00 -4.860000000000000320e+00 nan -8.800000000000000044e-01 -8.570000000000000284e+00 2.560000000000000053e+00 -7.030000000000000249e+00 -6.330000000000000071e+00 |
||||
8.609999999999999432e+00 -7.820000000000000284e+00 1.128999999999999915e+01 -1.324000000000000021e+01 -4.919999999999999929e+00 nan -1.296000000000000085e+01 -1.282000000000000028e+01 -1.403999999999999915e+01 1.456000000000000050e+01 |
||||
-2.859999999999999876e+00 -9.039999999999999147e+00 -1.227999999999999936e+01 2.410000000000000142e+00 -6.040000000000000036e+00 5.160000000000000142e+00 nan -1.091000000000000014e+01 -1.443999999999999950e+01 -1.372000000000000064e+01 |
||||
3.870000000000000107e+00 1.281000000000000050e+01 -1.790000000000000036e+00 1.364000000000000057e+01 -0.000000000000000000e+00 -2.240000000000000213e+00 3.609999999999999876e+00 nan 1.053999999999999915e+01 -1.417999999999999972e+01 |
||||
7.730000000000000426e+00 -8.000000000000000167e-02 8.800000000000000044e-01 8.570000000000000284e+00 -2.560000000000000053e+00 7.030000000000000249e+00 6.330000000000000071e+00 1.296000000000000085e+01 nan -1.169999999999999929e+01 |
||||
1.282000000000000028e+01 1.403999999999999915e+01 -1.456000000000000050e+01 1.091000000000000014e+01 1.443999999999999950e+01 1.372000000000000064e+01 -1.053999999999999915e+01 1.417999999999999972e+01 1.169999999999999929e+01 nan |
@ -0,0 +1,26 @@
|
||||
## Exercise 9 Football tournament |
||||
|
||||
The goal of this exercise is to learn to use permutations, complex |
||||
|
||||
A Football tournament is organized in your city. There are 10 teams and the director of the tournaments wants you to create a first round as exciting as possible. To do so, you are allowed to choose the pairs. As a former data scientist, you implemented a model based on teams' current season performance. This models predicts the score difference between two teams. You used this algorithm to predict the score difference for every possible pair. |
||||
The matrix returned is a 2-dimensional array that contains in (i,j) the score difference between team i and j. The matrix is in `model_forecasts.txt`. |
||||
|
||||
Using this output, what are the pairs that will give the most interesting matches ? |
||||
|
||||
If a team wins 7-1 the match is obviously less exciting than a match where the winner wins 2-1. |
||||
The criteria that corresponds to **the pairs that will give the most interesting matches** is **the pairs that minimize the sum of squared differences** |
||||
|
||||
The expected output is: |
||||
|
||||
```console |
||||
[[m1_t1 m2_t1 m3_t1 m4_t1 m5_t1] |
||||
[m1_t2 m2_t2 m3_t2 m4_t2 m5_t2]] |
||||
|
||||
``` |
||||
|
||||
- m1_t1 stands for match1_team1 |
||||
- m1_t1 plays against m1_t2 ... |
||||
|
||||
**Usage of for loop is not allowed, you may need to use the library** `itertools` **to create permutations** |
||||
|
||||
https://docs.python.org/3.9/library/itertools.html |
@ -0,0 +1,31 @@
|
||||
# W1D01 Piscine AI - Data Science |
||||
|
||||
## NumPy |
||||
The goal of this day is to understand practical usage of **NumPy**. **NumPy** is a commonly used Python data analysis package. By using **NumPy**, you can speed up your workflow, and interface with other packages in the Python ecosystem, like scikit-learn, that use **NumPy** under the hood. **NumPy** was originally developed in the mid 2000s, and arose from an even older package called Numeric. This longevity means that almost every data analysis or machine learning package for Python leverages **NumPy** in some way. |
||||
|
||||
## Exercises of the day |
||||
|
||||
- Exercise 0 Environment and libraries |
||||
- Exercise 1 Your first NumPy array |
||||
- Exercise 2 Zeros |
||||
- Exercise 3 Slicing |
||||
- Exercise 4 Random |
||||
- Exercise 5 Split, concatenate, reshape arrays |
||||
- Exercise 6 Broadcasting and Slicing |
||||
- Exercise 7 NaN |
||||
- Exercise 8 Wine |
||||
- Exercise 9 Football tournament |
||||
|
||||
## Virtual Environment |
||||
- Python 3.x |
||||
- NumPy |
||||
- Jupyter or JupyterLab |
||||
|
||||
*Version of NumPy I used to do the exercises: 1.18.1*. |
||||
I suggest to use the most recent one. |
||||
|
||||
## Ressources |
||||
|
||||
- https://medium.com/fintechexplained/why-should-we-use-NumPy-c14a4fb03ee9 |
||||
- https://numpy.org/doc/ |
||||
- https://jakevdp.github.io/PythonDataScienceHandbook/ |
@ -0,0 +1,9 @@
|
||||
##### The exercice is validated is all questions of the exercice are validated |
||||
|
||||
##### Activate the virtual environment. If you used `conda` run `conda activate your_env` |
||||
|
||||
##### Run `python --version` |
||||
|
||||
###### Does it print `Python 3.x`? x >= 8 |
||||
|
||||
##### Does `import jupyter`, `import numpy` and `import pandas` run without any error ? |
@ -0,0 +1,64 @@
|
||||
# W1D02 Piscine AI - Data Science |
||||
|
||||
## Pandas |
||||
|
||||
The goal of this day is to understand practical usage of **Pandas**. |
||||
As **Pandas** in intensively used in Data Science, other days of the piscine will be dedicated to it. |
||||
|
||||
Not only is the **Pandas** library a central component of the data science toolkit but it is used in conjunction with other libraries in that collection. |
||||
|
||||
**Pandas** is built on top of the NumPy package, meaning a lot of the structure of NumPy is used or replicated in **Pandas**. Data in **Pandas** is often used to feed statistical analysis in SciPy, plotting functions from Matplotlib, and machine learning algorithms in Scikit-learn. |
||||
|
||||
Most of the topics we will cover today are explained and describes with examples in the first resource. The number of exercises is low on purpose: Take the time to understand the chapter 5 of the resource, even if there are 40 pages. |
||||
|
||||
## Exercises of the day |
||||
|
||||
- Exercice 0 Environment and libraries |
||||
- Exercise 1 Your first DataFrame |
||||
- Exercise 2 Electric power consumption |
||||
- Exercise 3 E-commerce purchases |
||||
- Exercise 4 Handling missing values |
||||
|
||||
## Virtual Environment |
||||
- Python 3.x |
||||
- NumPy |
||||
- Pandas |
||||
- Jupyter or JupyterLab |
||||
|
||||
*Version of Pandas I used to do the exercises: 1.0.1*. |
||||
I suggest to use the most recent one. |
||||
|
||||
## Resources |
||||
|
||||
- If I had to give you one resource it would be this one: |
||||
|
||||
https://bedford-computing.co.uk/learning/wp-content/uploads/2015/10/Python-for-Data-Analysis.pdf |
||||
|
||||
It contains ALL you need to know about Pandas. |
||||
|
||||
- Pandas documentation: |
||||
|
||||
- https://pandas.pydata.org/docs/ |
||||
|
||||
- https://jakevdp.github.io/PythonDataScienceHandbook/ |
||||
|
||||
- https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf |
||||
|
||||
- https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/ |
||||
|
||||
- https://jakevdp.github.io/PythonDataScienceHandbook/03.04-missing-values.html |
||||
|
||||
# Exercise 0 Environment and libraries |
||||
|
||||
The goal of this exercise is to set up the Python work environment with the required libraries. |
||||
|
||||
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries. |
||||
|
||||
I recommend to use: |
||||
|
||||
- the **last stable versions** of Python. |
||||
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science. |
||||
- one of the most recents versions of the libraries required |
||||
|
||||
1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy` and `jupyter`. |
||||
|
@ -0,0 +1,17 @@
|
||||
##### The exercice is validated is all questions of the exercice are validated |
||||
|
||||
##### The solution of question 1 is accepted if the DataFrame created is the same as the "model" DataFrame. Check that the index is not 1,2,3,4,5. |
||||
|
||||
##### The solution of question 2 is accepted if the types you get for the columns are as below and if if the types of the first value of the columns are as below |
||||
|
||||
```console |
||||
<class 'pandas.core.series.Series'> |
||||
<class 'pandas.core.series.Series'> |
||||
<class 'pandas.core.series.Series'> |
||||
``` |
||||
|
||||
```console |
||||
<class 'str'> |
||||
<class 'list'> |
||||
<class 'float'> |
||||
``` |
@ -0,0 +1,17 @@
|
||||
# Exercice 1 |
||||
|
||||
The goal of this exercise is to learn to create basic Pandas objects. |
||||
|
||||
1. Create a DataFrame as below this using two ways: |
||||
- From a NumPy array |
||||
- From a Pandas Series |
||||
|
||||
| | color | list | number | |
||||
|---:|:--------|:--------|---------:| |
||||
| 1 | Blue | [1, 2] | 1.1 | |
||||
| 3 | Red | [3, 4] | 2.2 | |
||||
| 5 | Pink | [5, 6] | 3.3 | |
||||
| 7 | Grey | [7, 8] | 4.4 | |
||||
| 9 | Black | [9, 10] | 5.5 | |
||||
|
||||
2. Print the types for every columns and the types of the first value of every columns |
@ -0,0 +1,101 @@
|
||||
##### The exercice is validated is all questions of the exercice are validated |
||||
|
||||
##### The solution of question 1 is accepted if you use `drop` with `axis=1`.`inplace=True` may be useful to avoid to affect the result to a variable. A solution that could be accepted too (even if it's not a solution I recommend is `del`. |
||||
|
||||
##### The solution of question 2 is accepted if the DataFrame returns the output below. If the type of the index is not `dtype='datetime64[ns]'` the solution is not accepted. I recommend to use `set_index` with `inplace=True` to do so. |
||||
|
||||
```python |
||||
Input: df.head().index |
||||
|
||||
Output: |
||||
|
||||
DatetimeIndex(['2006-12-16', '2006-12-16','2006-12-16', '2006-12-16','2006-12-16'], |
||||
dtype='datetime64[ns]', name='Date', freq=None) |
||||
``` |
||||
|
||||
##### The solution of question 3 is accepted if all the types are `float64` as below. The preferred solution is `pd.to_numeric` with `coerce=True`. |
||||
|
||||
```python |
||||
Input: df.dtypes |
||||
|
||||
Output: |
||||
|
||||
Global_active_power float64 |
||||
Global_reactive_power float64 |
||||
Voltage float64 |
||||
Global_intensity float64 |
||||
Sub_metering_1 float64 |
||||
dtype: object |
||||
|
||||
``` |
||||
|
||||
##### The solution of question 4 is accepted if you use `df.describe()`. |
||||
|
||||
##### The solution of question 5 is accepted if you used `dropna` and have the number of missing values equal to 0.You should have noticed that 25979 rows contain missing values (for a total of 129895). `df.isna().sum()` allows to check the number of missing values and `df.dropna()` with `inplace=True` allows to remove the rows with missing values. |
||||
|
||||
##### The solution of question 6 is accepted if one of the two approaches below were used: |
||||
|
||||
```python |
||||
#solution 1 |
||||
df.loc[:,'A'] = (df['A'] + 1) * 0.06 |
||||
|
||||
#solution 2 |
||||
df.loc[:,'A'] = df.loc[:,'A'].apply(lambda x: (x+1)*0.06) |
||||
|
||||
``` |
||||
|
||||
|
||||
You may wonder `df.loc[:,'A']` is required and if `df['A'] = ...` works too. **The answer is no**. This is important in Pandas. Depending on the version of Pandas, it may return a warning. The reason is that you are affecting a value to a **copy** of the DataFrame and not in the DataFrame. |
||||
More details: https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas |
||||
|
||||
##### The solution of question 7 is accepted as long as the output of `print(filtered_df.head().to_markdown())` is as below and if the number of rows is equal to **449667**. |
||||
|
||||
| Date | Global_active_power | Global_reactive_power | |
||||
|:--------------------|----------------------:|------------------------:| |
||||
| 2008-12-27 00:00:00 | 0.996 | 0.066 | |
||||
| 2008-12-27 00:00:00 | 1.076 | 0.162 | |
||||
| 2008-12-27 00:00:00 | 1.064 | 0.172 | |
||||
| 2008-12-27 00:00:00 | 1.07 | 0.174 | |
||||
| 2008-12-27 00:00:00 | 0.804 | 0.184 | |
||||
|
||||
##### The solution of question 8 is accepted if the output is |
||||
|
||||
```console |
||||
Global_active_power 0.254 |
||||
Global_reactive_power 0.000 |
||||
Voltage 238.350 |
||||
Global_intensity 1.200 |
||||
Sub_metering_1 0.000 |
||||
Name: 2007-02-16 00:00:00, dtype: float64 |
||||
|
||||
``` |
||||
|
||||
##### The solution of question 9 if the output is `Timestamp('2009-02-22 00:00:00')` |
||||
|
||||
##### The solution of question 10 if the output of `print(sorted_df.tail().to_markdown())` is |
||||
|
||||
| Date | Global_active_power | Global_reactive_power | Voltage | |
||||
|:--------------------|----------------------:|------------------------:|----------:| |
||||
| 2008-08-28 00:00:00 | 0.076 | 0 | 234.88 | |
||||
| 2008-08-28 00:00:00 | 0.076 | 0 | 235.18 | |
||||
| 2008-08-28 00:00:00 | 0.076 | 0 | 235.4 | |
||||
| 2008-08-28 00:00:00 | 0.076 | 0 | 235.64 | |
||||
| 2008-12-08 00:00:00 | 0.076 | 0 | 236.5 | |
||||
|
||||
##### The solution of question 11 is accepted if the output is as below. The solution is based on `groupby` which creates groups based on the index `Date` and aggregates the groups using the `mean`. |
||||
|
||||
```console |
||||
Date |
||||
2006-12-16 3.053475 |
||||
2006-12-17 2.354486 |
||||
2006-12-18 1.530435 |
||||
2006-12-19 1.157079 |
||||
2006-12-20 1.545658 |
||||
... |
||||
2010-12-07 0.770538 |
||||
2010-12-08 0.367846 |
||||
2010-12-09 1.119508 |
||||
2010-12-10 1.097008 |
||||
2010-12-11 1.275571 |
||||
Name: Global_active_power, Length: 1433, dtype: float64 |
||||
``` |
@ -0,0 +1 @@
|
||||
Empty file. The original is too big to be pushed on Github. |
@ -0,0 +1,24 @@
|
||||
# Exercise 2 **Electric power consumption** |
||||
|
||||
The goal of this exercise is to learn to manipulate real data with Pandas. |
||||
|
||||
The data set used is **Individual household electric power consumption** |
||||
|
||||
1. Delete the columns `Time`, `Sub_metering_2` and `Sub_metering_3` |
||||
2. Set `Date` as index |
||||
3. Create a function that takes as input the DataFrame with the data set and returns a DataFrame with updated types: |
||||
|
||||
```python |
||||
def update_types(df): |
||||
#TODO |
||||
return df |
||||
``` |
||||
|
||||
4. Use `describe` to have an overview on the data set |
||||
5. Delete the rows with missing values |
||||
6. Modify `Sub_metering_1` by adding 1 to it and multiplying the total by 0.06. If x is a row the output is: (x+1)*0.06 |
||||
7. Select all the rows for which the Date is greater or equal than 2008-12-27 and `Voltage` is greater or equal than 242 |
||||
8. Print the 88888th row. |
||||
9. What is the date for which the `Global_active_power` is maximal ? |
||||
10. Sort the first three columns by descending order of `Global_active_power` and ascending order of `Voltage`. |
||||
11. Compute the daily average of `Global_active_power`. |
@ -0,0 +1,49 @@
|
||||
##### The exercice is validated is all questions of the exercice are validated |
||||
|
||||
##### To validate this exercise all answers should return the expected numerical value given in the correction AND uses Pandas. For example using NumPy to compute the mean doesn't respect the philosophy of the exercise which is to use Pandas. |
||||
|
||||
#### The solution of question 1 is accepted if it contains **10000 entries** and **14 columns**. There many solutions based on: shape, info, describe. |
||||
|
||||
##### The solution of question 2 is accepted if the answer is **50.34730200000025**. |
||||
|
||||
Even if `np.mean` gives the solution, `df['Purchase Price'].mean()` is preferred |
||||
|
||||
##### The solution of question 3 is accepted if the min is `0`and the max is `99.989999999999995` |
||||
|
||||
|
||||
##### The solution of question 4 is accepted if the answer is **1098** |
||||
|
||||
##### The solution of question 5 is accepted if the answer is **30** |
||||
|
||||
##### The solution of question 6 is accepted if the are `4932` people that made the purchase during the `AM` and `5068` people that made the purchase during `PM`. There many ways to the solution but the goal of this question was to make you use `value_counts` |
||||
|
||||
##### The solution of question 7 is accepted if the answer is as below. There many ways to the solution but the goal of this question was to make you use `value_counts` |
||||
|
||||
Interior and spatial designer 31 |
||||
|
||||
Lawyer 30 |
||||
|
||||
Social researcher 28 |
||||
|
||||
Purchasing manager 27 |
||||
|
||||
Designer, jewellery 27 |
||||
|
||||
|
||||
8. ##### The solution of question 8 is accepted if the purchase price is **75.1** |
||||
|
||||
|
||||
##### The solution of question 9 is accepted if the email adress is **bondellen@williams-garza.com** |
||||
|
||||
##### The solution of question 10 is accepted if the answer is **39**. The prefered solution is based on this: `df[(df['A'] == X) & (df['B'] > Y)]` |
||||
|
||||
|
||||
##### The solution of question 11 is accepted if the answer is **1033**. The preferred solution is based on the usage of `apply` on a `lambda` function that slices the string that contains the expiration date. |
||||
|
||||
##### The solution of question 12 is accepted if the answer is as below. The preferred solution is based on the usage of `apply` on a `lambda` function that slices the string that contains the email. The `lambda` function uses `split` to split the string on `@`. Finally, `value_counts` is used to count the occurrences. |
||||
|
||||
- hotmail.com 1638 |
||||
- yahoo.com 1616 |
||||
- gmail.com 1605 |
||||
- smith.com 42 |
||||
- williams.com 37 |
@ -0,0 +1,20 @@
|
||||
# Exercice 3: E-commerce purchases<w> |
||||
|
||||
The goal of this exercise is to learn to manipulate real data with Pandas. This exercise is less guided since the exercise 2 should have given you a nice introduction. |
||||
|
||||
The data set used is **E-commerce purchases**. |
||||
|
||||
Questions: |
||||
|
||||
1. How many rows and columns are there? |
||||
2. What is the average Purchase Price? |
||||
3. What were the highest and lowest purchase prices? |
||||
4. How many people have English `'en'` as their Language of choice on the website? |
||||
5. How many people have the job title of `"Lawyer"` ? |
||||
6. How many people made the purchase during the `AM` and how many people made the purchase during `PM` ? |
||||
7. What are the 5 most common Job Titles? |
||||
8. Someone made a purchase that came from Lot: `"90 WT"` , what was the Purchase Price for this transaction? |
||||
9. What is the email of the person with the following Credit Card Number: `4926535242672853` |
||||
10. How many people have American Express as their Credit Card Provider and made a purchase above `$95` ? |
||||
11. How many people have a credit card that expires in `2025`? |
||||
12. What are the top 5 most popular email providers/hosts (e.g. gmail.com, yahoo.com, etc...) |
@ -0,0 +1,32 @@
|
||||
##### The exercice is validated is all questions of the exercice are validated (except the bonus question) |
||||
|
||||
##### The solution of question 1 is accepted if you have done these two steps in that order. First, convert the numerical columns to `float` and then fill the missing values. The first step may involve `pd.to_numeric(df.loc[:,col], errors='coerce')`. The second step is validated if you eliminated all missing values. However there are many possibilities to fill the missing values. Here is one of them: |
||||
|
||||
example: |
||||
|
||||
```python |
||||
df.fillna({0:df.sepal_length.mean(), |
||||
2:df.sepal_width.median(), |
||||
3:0, |
||||
4:0}) |
||||
``` |
||||
|
||||
##### The solution of question 2 is accepted if the solution is `df.loc[:,col].fillna(df[col].median())`. |
||||
|
||||
##### The solution of bonus question is accepted if you find out this answer: Once we filled the missing values as suggested in the first question, `df.describe()` returns this interesting summary. We notice that the mean is way higher than the median. It means that there are maybe some outliers in the data. The quantile 75 and the max confirm that: 75% of the flowers have a sepal length smaller than 6.4 cm, but the max is 6900 cm. If you check on the internet you realise this small flower can't be that big. The outliers have a major impact on the mean which equals to 56.9. Filling this value for the missing value is not correct since it doesn't correspond to the real size of this flower. That is why in that case the best strategy to fill the missing values is the median. The truth is that I modified the data set ! But real data sets ALWAYS contains outliers. Always think about the meaning of the data transformation ! If you fill the missing values by zero, it means that you consider that the length or width of some flowers may be 0. It doesn't make sense. |
||||
|
||||
|
||||
| | sepal_length | sepal_width | petal_length | petal_width | |
||||
|:------|---------------:|--------------:|---------------:|--------------:| |
||||
| count | 146 | 141 | 120 | 147 | |
||||
| mean | 56.9075 | 52.6255 | 15.5292 | 12.0265 | |
||||
| std | 572.222 | 417.127 | 127.46 | 131.873 | |
||||
| min | -4.4 | -3.6 | -4.8 | -2.5 | |
||||
| 25% | 5.1 | 2.8 | 2.725 | 0.3 | |
||||
| 50% | 5.75 | 3 | 4.5 | 1.3 | |
||||
| 75% | 6.4 | 3.3 | 5.1 | 1.8 | |
||||
| max | 6900 | 3809 | 1400 | 1600 | |
||||
|
||||
|
||||
|
||||
##### The solution of bonus question is accepted if you noticed that there are some negative values and the huge values, you will be a good data scientist. **YOU SHOULD ALWAYS TRY TO UNDERSTAND YOUR DATA**. Print the row with index 122 ;-) This week, we will have the opportunity to focus on the data pre-processing to understand how the outliers can be handled. |
|
@ -0,0 +1,152 @@
|
||||
sepal_length,sepal_width,petal_length,petal_width, flower |
||||
5.1,3.5,1.4,0.2,Iris-setosa |
||||
4.9,3.0,1.4,0.2,Iris-setosa |
||||
4.7,3.2,1.3,0.2,Iris-setosa |
||||
4.6,3.1,1.5,0.2,Iris-setosa |
||||
5.0,-3.6,-1.4,0.2,Iris-setosa |
||||
5.4,3.9,1.7,0.4,Iris-setosa |
||||
4.6,3.4,1.4,0.3,Iris-setosa |
||||
5.0,3.4,1.5,0.2,Iris-setosa |
||||
-4.4,2.9,1400,0.2,Iris-setosa |
||||
4.9,3.1,1.5,0.1,Iris-setosa |
||||
5.4,3.7,1.5,0.2,Iris-setosa |
||||
4.8,3.4,1.6,0.2,Iris-setosa |
||||
4.8,3.0,1.4,0.1,Iris-setosa |
||||
4.3,3.0,1.1,0.1,Iris-setosa |
||||
5.8,4.0,1.2,0.2,Iris-setosa |
||||
5.7,4.4,1500,0.4,Iris-setosa |
||||
5.4,3.9,1.3,0.4,Iris-setosa |
||||
5.1,3.5,1.4,0.3,Iris-setosa |
||||
5.7,3.8,1.7,0.3,Iris-setosa |
||||
5.1,3.8,1.5,0.3,Iris-setosa |
||||
5.4,3.4,-1.7,0.2,Iris-setosa |
||||
5.1,3.7,1.5,0.4,Iris-setosa |
||||
4.6,3.6,1.0,0.2,Iris-setosa |
||||
5.1,3.3,1.7,0.5,Iris-setosa |
||||
4.8,3.4,1.9,0.2,Iris-setosa |
||||
5.0,-3.0,1.6,0.2,Iris-setosa |
||||
5.0,3.4,1.6,0.4,Iris-setosa |
||||
5.2,3.5,1.5,0.2,Iris-setosa |
||||
5.2,3.4,1.4,0.2,Iris-setosa |
||||
4.7,3.2,1.6,0.2,Iris-setosa |
||||
4.8,3.1,1.6,0.2,Iris-setosa |
||||
5.4,3.4,1.5,0.4,Iris-setosa |
||||
5.2,4.1,1.5,0.1,Iris-setosa |
||||
5.5,4.2,1.4,0.2,Iris-setosa |
||||
4.9,3.1,1.5,0.1,Iris-setosa |
||||
5.0,3.2,1.2,0.2,Iris-setosa |
||||
5.5,3.5,1.3,0.2,Iris-setosa |
||||
4.9,,1.5,0.1,Iris-setosa |
||||
4.4,3.0,1.3,0.2,Iris-setosa |
||||
5.1,3.4,1.5,0.2,Iris-setosa |
||||
5.0,"3.5",1.3,0.3,Iris-setosa |
||||
4.5,2.3,1.3,0.3,Iris-setosa |
||||
4.4,3.2,1.3,0.2,Iris-setosa |
||||
5.0,3.5,1.6,0.6,Iris-setosa |
||||
5.1,3.8,1.9,0.4,Iris-setosa |
||||
4.8,3.0,1.4,0.3,Iris-setosa |
||||
5.1,3809,1.6,0.2,Iris-setosa |
||||
4.6,3.2,1.4,0.2,Iris-setosa |
||||
5.3,3.7,1.5,0.2,Iris-setosa |
||||
5.0,3.3,1.4,0.2,Iris-setosa |
||||
7.0,3.2,4.7,1.4,Iris-versicolor |
||||
6.4,3200,4.5,1.5,Iris-versicolor |
||||
6.9,3.1,4.9,1.5,Iris-versicolor |
||||
5.5,2.3,4.0,1.3,Iris-versicolor |
||||
6.5,2.8,4.6,1.5,Iris-versicolor |
||||
5.7,2.8,4.5,1.3,Iris-versicolor |
||||
6.3,3.3,4.7,1600,Iris-versicolor |
||||
4.9,2.4,3.3,1.0,Iris-versicolor |
||||
6.6,2.9,4.6,1.3,Iris-versicolor |
||||
5.2,2.7,3.9,,Iris-versicolor |
||||
5.0,2.0,3.5,1.0,Iris-versicolor |
||||
5.9,3.0,4.2,1.5,Iris-versicolor |
||||
6.0,2.2,4.0,1.0,Iris-versicolor |
||||
6.1,2.9,4.7,1.4,Iris-versicolor |
||||
5.6,2.9,3.6,1.3,Iris-versicolor |
||||
6.7,3.1,4.4,1.4,Iris-versicolor |
||||
5.6,3.0,4.5,1.5,Iris-versicolor |
||||
5.8,2.7,4.1,1.0,Iris-versicolor |
||||
6.2,2.2,4.5,1.5,Iris-versicolor |
||||
5.6,2.5,3.9,1.1,Iris-versicolor |
||||
5.9,3.2,4.8,1.8,Iris-versicolor |
||||
6.1,2.8,4.0,1.3,Iris-versicolor |
||||
6.3,2.5,4.9,1.5,Iris-versicolor |
||||
6.1,2.8,4.7,1.2,Iris-versicolor |
||||
6.4,2.9,4.3,1.3,Iris-versicolor |
||||
6.6,3.0,4.4,1.4,Iris-versicolor |
||||
6.8,2.8,4.8,1.4,Iris-versicolor |
||||
6.7,3.0,5.0,1.7,Iris-versicolor |
||||
6.0,2.9,4.5,1.5,Iris-versicolor |
||||
5.7,2.6,3.5,1.0,Iris-versicolor |
||||
5.5,2.4,3.8,1.1,Iris-versicolor |
||||
5.5,2.4,3.7,1.0,Iris-versicolor |
||||
5.8,2.7,3.9,1.2,Iris-versicolor |
||||
6.0,2.7,5.1,1.6,Iris-versicolor |
||||
5.4,3.0,4.5,1.5,Iris-versicolor |
||||
6.0,3.4,4.5,1.6,Iris-versicolor |
||||
6.7,3.1,4.7,1.5,Iris-versicolor |
||||
6.3,2.3,4.4,1.3,Iris-versicolor |
||||
5.6,3.0,4.1,1.3,Iris-versicolor |
||||
5.5,2.5,4.0,1.3,Iris-versicolor |
||||
5.5,2.6,4.4,1.2,Iris-versicolor |
||||
6.1,3.0,4.6,1.4,Iris-versicolor |
||||
5.8,2.6,4.0,1.2,Iris-versicolor |
||||
5.0,2.3,3.3,1.0,Iris-versicolor |
||||
5.6,2.7,4.2,1.3,Iris-versicolor |
||||
5.7,3.0,4.2,1.2,Iris-versicolor |
||||
5.7,2.9,4.2,1.3,Iris-versicolor |
||||
6.2,2.9,4.3,1.3,Iris-versicolor |
||||
5.1,2.5,3.0,1.1,Iris-versicolor |
||||
5.7,2.8,4.1,1.3,Iris-versicolor |
||||
6.3,3.3,6.0,2.5,Iris-virginica |
||||
5.8,2.7,5.1,1.9,Iris-virginica |
||||
7.1,3.0,5.9,2.1,Iris-virginica |
||||
6.3,2.9,5.6,1.8,Iris-virginica |
||||
6.5,3.0,5.8,2.2,Iris-virginica |
||||
7.6,3.0,6.6,2.1,Iris-virginica |
||||
4.9,2.5,4.5,1.7,Iris-virginica |
||||
7.3,2.9,6.3,1.8,Iris-virginica |
||||
6.7,2.5,5.8,1.8,Iris-virginica |
||||
7.2,3.6,6.1,2.5,Iris-virginica |
||||
6.5,3.2,5.1,2.0,Iris-virginica |
||||
6.4,2.7,5.3,1.9,Iris-virginica |
||||
6.8,3.0,5.5,2.1,Iris-virginica |
||||
5.7,2.5,5.0,2.0,Iris-virginica |
||||
5.8,2.8,5.1,2.4,Iris-virginica |
||||
6.4,3.2,5.3,2.3,Iris-virginica |
||||
6.5,3.0,5.5,1.8,Iris-virginica |
||||
7.7,3.8,6.7,2.2,Iris-virginica |
||||
7.7,2.6,6.9,2.3,Iris-virginica |
||||
6.0,2.2,5.0,1.5,Iris-virginica |
||||
6.9,3.2,5.7,2.3,Iris-virginica |
||||
5.6,2.8,4.9,2.0,Iris-virginica |
||||
7.7,2.8,6.7,2.0,Iris-virginica |
||||
6.3,2.7,4.9,1.8,Iris-virginica |
||||
6.7,3.3,5.7,2.1,Iris-virginica |
||||
7.2,3.2,6.0,1.8,Iris-virginica |
||||
6.2,2.8,-4.8,1.8,Iris-virginica |
||||
6.1,3.0,4.9,1.8,Iris-virginica |
||||
6.4,2.8,5.6,2.1,Iris-virginica |
||||
7.2,3.0,5.8,1.6,Iris-virginica |
||||
7.4,2.8,6.1,1.9,Iris-virginica |
||||
7.9,3.8,6.4,2.0,Iris-virginica |
||||
6.-4,2.8,5.6,2.2,Iris-virginica |
||||
6.3,2.8,"5.1",1.5,Iris-virginica |
||||
6.1,2.6,5.6,1.4,Iris-virginica |
||||
7.7,3.0,6.1,2.3,Iris-virginica |
||||
6.3,3.4,5.6,2.4,Iris-virginica |
||||
6.4,3.1,5.5,1.8,Iris-virginica |
||||
6.0,3.0,4.8,1.8,Iris-virginica |
||||
6900,3.1,5.4,2.1,Iris-virginica |
||||
6.7,3.1,5.6,2.4,Iris-virginica |
||||
6.9,3.1,5.1,2.3,Iris-virginica |
||||
580,2.7,5.1,1.9,Iris-virginica |
||||
6.8,3.2,5.9,2.3,Iris-virginica |
||||
6.7,3.3,5.7,-2.5,Iris-virginica |
||||
6.7,3.0,5.2,2.3,Iris-virginica |
||||
6.3,2.5,5.0,1.9,Iris-virginica |
||||
6.5,3.0,5.2,2.0,Iris-virginica |
||||
6.2,3.4,5.4,2.3,Iris-virginica |
||||
5.9,3.0,5.1,1.8,Iris-virginica |
||||
|
@ -0,0 +1,26 @@
|
||||
# Exercice 4 Handling missing values |
||||
|
||||
The goal of this exercise is to learn to handle missing values. In the previous exercise we used the first techniques: filter out the missing values. We were lucky because the proportion of missing values was low. But in some cases, dropping the missing values is not possible because the filtered data set would be too small. |
||||
|
||||
This article explains the different types of missing data and how they should be handled. |
||||
|
||||
https://towardsdatascience.com/data-cleaning-with-python-and-pandas-detecting-missing-values-3e9c6ebcf78b |
||||
|
||||
"**It’s important to understand these different types of missing data from a statistics point of view. The type of missing data will influence how you deal with filling in the missing values.**" |
||||
|
||||
- Preliminary: Drop the `flower` column |
||||
|
||||
1. Fill the missing values with a different "strategy" for each column: |
||||
|
||||
`sepal_length` -> `mean` |
||||
|
||||
`sepal_width` -> `median` |
||||
|
||||
`petal_length`, `petal_width` -> `0` |
||||
|
||||
2. Fill the missing values using the median of the associated column using `fillna`. |
||||
|
||||
|
||||
- Bonus questions: |
||||
- Filling the missing values by 0 or the mean of the associated column is common in Data Science. In that case, explain why filling the missing values with 0 or the mean is a bad idea. |
||||
- Find a special row ;-) |
@ -0,0 +1,48 @@
|
||||
# W1D02 Piscine AI - Data Science |
||||
|
||||
## Pandas |
||||
|
||||
The goal of this day is to understand practical usage of **Pandas**. |
||||
As **Pandas** in intensively used in Data Science, other days of the piscine will be dedicated to it. |
||||
|
||||
Not only is the **Pandas** library a central component of the data science toolkit but it is used in conjunction with other libraries in that collection. |
||||
|
||||
**Pandas** is built on top of the NumPy package, meaning a lot of the structure of NumPy is used or replicated in **Pandas**. Data in **Pandas** is often used to feed statistical analysis in SciPy, plotting functions from Matplotlib, and machine learning algorithms in Scikit-learn. |
||||
|
||||
Most of the topics we will cover today are explained and describes with examples in the first resource. The number of exercises is low on purpose: Take the time to understand the chapter 5 of the resource, even if there are 40 pages. |
||||
|
||||
## Exercises of the day |
||||
|
||||
- Exercise 1 Your first DataFrame |
||||
- Exercise 2 Electric power consumption |
||||
- Exercise 3 E-commerce purchases |
||||
- Exercise 4 Handling missing values |
||||
|
||||
## Virtual Environment |
||||
- Python 3.x |
||||
- NumPy |
||||
- Pandas |
||||
- Jupyter or JupyterLab |
||||
|
||||
*Version of Pandas I used to do the exercises: 1.0.1*. |
||||
I suggest to use the most recent one. |
||||
|
||||
## Resources |
||||
|
||||
- If I had to give you one resource it would be this one: |
||||
|
||||
https://bedford-computing.co.uk/learning/wp-content/uploads/2015/10/Python-for-Data-Analysis.pdf |
||||
|
||||
It contains ALL you need to know about Pandas. |
||||
|
||||
- Pandas documentation: |
||||
|
||||
- https://pandas.pydata.org/docs/ |
||||
|
||||
- https://jakevdp.github.io/PythonDataScienceHandbook/ |
||||
|
||||
- https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf |
||||
|
||||
- https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/ |
||||
|
||||
- https://jakevdp.github.io/PythonDataScienceHandbook/03.04-missing-values.html |
@ -0,0 +1,9 @@
|
||||
##### The exercice is validated is all questions of the exercice are validated |
||||
|
||||
##### Activate the virtual environment. If you used `conda` run `conda activate your_env` |
||||
|
||||
##### Run `python --version` |
||||
|
||||
###### Does it print `Python 3.x`? x >= 8 |
||||
|
||||
##### Does `import jupyter`, `import numpy`, `import pandas`, `matplotlib` and `plotly` run without any error ? |
@ -0,0 +1,62 @@
|
||||
# W1D03 Piscine AI - Data Science |
||||
|
||||
## Visualizations |
||||
|
||||
While working on a dataset it is important to check the distribution of the data. Obviously, for most of humans it is difficult to visualize the data in more than 3 dimensions |
||||
|
||||
"Viz" is important to understand the data and to show results. We'll discover three libraries to visualize data in Python. These are one of the most used visualisation "libraries" in Python: |
||||
|
||||
- Pandas visualization module |
||||
- Matplotlib |
||||
- Plotly |
||||
|
||||
The goal is to understand the basics of those libraries. You'll have time during the project to master one (or the three) of them. |
||||
You may wonder why using one library is not enough. The reason is simple: it depends on the usage. |
||||
For example if you want to check the data quickly you may want to use Pandas viz module or Matplotlib. |
||||
If you want to plot a custom and more elaborated plot I suggest to use Matplotlib or Plotly. |
||||
And, if you want to create a very nice and interactive plot I suggest to use Plotly. |
||||
|
||||
|
||||
## Exercises of the day |
||||
|
||||
- Exercise 1 Pandas plot 1 |
||||
- Exercise 2 Pandas plot 2 |
||||
- Exercise 3 Matplotlib 1 |
||||
- Exercise 4 Matplotlib 2 |
||||
- Exercise 5 Matplotlib subplots |
||||
- Exercise 6 Plotly 1 |
||||
- Exercise 7 Plotly Box plots |
||||
|
||||
|
||||
## Virtual Environment |
||||
- Python 3.x |
||||
- NumPy |
||||
- Pandas |
||||
- Matplotlib |
||||
- Plotly |
||||
- Jupyter or JupyterLab |
||||
|
||||
I suggest to use the most recent version of the packages. |
||||
|
||||
## Resources |
||||
|
||||
- https://matplotlib.org/3.3.3/tutorials/index.html |
||||
- https://towardsdatascience.com/matplotlib-tutorial-learn-basics-of-pythons-powerful-plotting-library-b5d1b8f67596 |
||||
|
||||
- https://github.com/rougier/matplotlib-tutorial |
||||
- https://jakevdp.github.io/PythonDataScienceHandbook/05.13-kernel-density-estimation.html |
||||
|
||||
|
||||
# Exercise 0 Environment and libraries |
||||
|
||||
The goal of this exercise is to set up the Python work environment with the required libraries. |
||||
|
||||
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries. |
||||
|
||||
I recommend to use: |
||||
|
||||
- the **last stable versions** of Python. |
||||
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science. |
||||
- one of the most recents versions of the libraries required |
||||
|
||||
1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy`, `jupyter`, `matplotlib` and `plotly`. |
@ -0,0 +1,8 @@
|
||||
##### The solution of question 1 is accepted if the plot reproduces the plot in the image and respect those criteria. |
||||
|
||||
###### Does it have a the title ? |
||||
###### Does it have a name on x-axis ? |
||||
###### Does it have a legend ? |
||||
![alt text][logo] |
||||
|
||||
[logo]: ../w1day03_ex1_plot1.png "Bar plot ex1" |
@ -0,0 +1,28 @@
|
||||
# Exercise 1 Pandas plot 1 |
||||
|
||||
The goal of this exercise is to learn to create plots with use Pandas. Panda's `.plot()` is a wrapper for `matplotlib.pyplot.plot()`. |
||||
|
||||
Here is the data we will be using: |
||||
|
||||
```python |
||||
df = pd.DataFrame({ |
||||
'name':['christopher','marion','maria','mia','clement','randy','remi'], |
||||
'age':[70,30,22,19,45,33,20], |
||||
'gender':['M','F','F','F','M','M','M'], |
||||
'state':['california','dc','california','dc','california','new york','porto'], |
||||
'num_children':[2,0,0,3,8,1,4], |
||||
'num_pets':[5,1,0,5,2,2,3] |
||||
}) |
||||
``` |
||||
|
||||
1. Reproduce this plot. This plot is called a bar plot |
||||
|
||||
![alt text][logo] |
||||
|
||||
[logo]: ./w1day03_ex1_plot1.png "Bar plot ex1" |
||||
|
||||
The plot has to contain: |
||||
|
||||
- the title |
||||
- name on x-axis |
||||
- legend |
After Width: | Height: | Size: 9.5 KiB |
@ -0,0 +1,8 @@
|
||||
##### The solution of question 1 is accepted if the plot reproduces the plot in the image and respect the following criteria. You should also observe that the older people are, the the more children they have. |
||||
|
||||
###### Does it have a the title ? |
||||
###### Does it have a name on x-axis and y-axis ? |
||||
|
||||
![alt text][logo_ex2] |
||||
|
||||
[logo_ex2]: ../w1day03_ex2_plot1.png "Scatter plot ex2" |
@ -0,0 +1,26 @@
|
||||
## Exercise 2: Pandas plot 2 |
||||
|
||||
The goal of this exercise is to learn to create plots with use Pandas. Panda's `.plot()` is a wrapper for `matplotlib.pyplot.plot()`. |
||||
|
||||
```python |
||||
df = pd.DataFrame({ |
||||
'name':['christopher','marion','maria','mia','clement','randy','remi'], |
||||
'age':[70,30,22,19,45,33,20], |
||||
'gender':['M','F','F','F','M','M','M'], |
||||
'state':['california','dc','california','dc','california','new york','porto'], |
||||
'num_children':[4,2,1,0,3,1,0], |
||||
'num_pets':[5,1,0,2,2,2,3] |
||||
}) |
||||
``` |
||||
|
||||
1. Reproduce this plot. This plot is called a scatter plot. Do you observe a relationship between the age and the number of children ? |
||||
|
||||
![alt text][logo_ex2] |
||||
|
||||
[logo_ex2]: ./w1day03_ex2_plot1.png "Scatter plot ex2" |
||||
|
||||
The plot has to contain: |
||||
|
||||
- the title |
||||
- name on x-axis |
||||
- name on y-axis |
After Width: | Height: | Size: 11 KiB |
@ -0,0 +1,11 @@
|
||||
##### The solution of question 1 is accepted if the plot reproduces the plot in the image and respect those criteria. |
||||
|
||||
###### Does it have a the title ? |
||||
###### Does it have a name on x-axis and y-axis ? |
||||
###### Are the x-axis and y-axis limited to [1,8] ? |
||||
###### Is the line a red dashdot line with a width of 3 ? |
||||
###### Are the circles blue circles with a size of 12 ? |
||||
|
||||
![alt text][logo_ex3] |
||||
|
||||
[logo_ex3]: ../w1day03_ex3_plot1.png "Scatter plot ex3" |
@ -0,0 +1,18 @@
|
||||
## Exercise 3 Matplotlib 1 |
||||
|
||||
The goal of this plot is to learn to use Matplotlib to plot data. As you know, Matplotlib is the underlying library used by Pandas. It provides more options to plot custom visualizations. Howerver, most of the plots we will create with Matplotlib can be reproduced with Pandas' `.plot()`. |
||||
|
||||
1. Reproduce this plot. We assume the data points have integers coordinates. |
||||
|
||||
![alt text][logo_ex3] |
||||
|
||||
[logo_ex3]: ./w1day03_ex3_plot1.png "Scatter plot ex3" |
||||
|
||||
The plot has to contain: |
||||
|
||||
- the title |
||||
- name on x-axis and y-axis |
||||
- x-axis and y-axis are limited to [1,8] |
||||
- **style**: |
||||
- red dashdot line with a width of 3 |
||||
- blue circles with a size of 12 |
After Width: | Height: | Size: 27 KiB |
@ -0,0 +1,12 @@
|
||||
##### The solution of question 1 is accepted if the plot reproduces the plot in the image and respect those criteria. |
||||
|
||||
###### Does it have a the title ? |
||||
###### Does it have a name on x-axis and y-axis ? |
||||
###### Is the left data black ? |
||||
###### Is the right data red ? |
||||
|
||||
![alt text][logo_ex4] |
||||
|
||||
[logo_ex4]: ../w1day03_ex4_plot1.png "Twin axis ex4" |
||||
|
||||
https://matplotlib.org/gallery/api/two_scales.html |
@ -0,0 +1,25 @@
|
||||
# Exercise 4 Matplotlib 2 |
||||
|
||||
The goal of this plot is to learn to use Matplotlib to plot different lines in the same plot on different axis using `twinx`. This very useful to compare variables in different ranges. |
||||
|
||||
Here is the data: |
||||
|
||||
```python |
||||
left_data = [5, 7, 11, 13, 17] |
||||
right_data = [0.1, 0.2, 0.4, 0.8, -1.6] |
||||
x_axis = [0.0, 1.0, 2.0, 3.0, 4.0] |
||||
``` |
||||
|
||||
1. Reproduce this plot |
||||
|
||||
![alt text][logo_ex4] |
||||
|
||||
[logo_ex4]: ./w1day03_ex4_plot1.png "Twin axis plot ex4" |
||||
|
||||
The plot has to contain: |
||||
|
||||
- the title |
||||
- name on left y-axis and right y-axis |
||||
- **style**: |
||||
- left data in black |
||||
- right data in red |
After Width: | Height: | Size: 18 KiB |
@ -0,0 +1,11 @@
|
||||
##### The solution of question 1 is accepted if the plot reproduces the plot in the image and respect those criteria. |
||||
|
||||
###### Does it contain 6 subplots (2 rows, 3 columns)? |
||||
###### Does it have space between plots (`hspace=0.5` and `wspace=0.5`)? |
||||
###### Do all subplots contain a title: `Title i` ? |
||||
###### Do all subplots contain a text `(2,3,i)` centered at `(0.5, 0.5)`? *Hint*: check the parameter `ha` of `text` |
||||
###### Have all subplots been created in a for loop ? |
||||
|
||||
![alt text][logo_ex5] |
||||
|
||||
[logo_ex5]: ../w1day03_ex5_plot1.png "Subplots ex5" |
@ -0,0 +1,18 @@
|
||||
# Exercise 5 Matplotlib subplots |
||||
|
||||
The goal of this exercise is to learn to use Matplotlib to create subplots. |
||||
|
||||
1. Reproduce this plot using a **for loop**: |
||||
|
||||
![alt text][logo_ex5] |
||||
|
||||
[logo_ex5]: ./w1day03_ex5_plot1.png "Subplots ex5" |
||||
|
||||
The plot has to contain: |
||||
|
||||
- 6 subplots: 2 rows, 3 columns |
||||
- Keep space between plots: `hspace=0.5` and `wspace=0.5` |
||||
- Each plot contains |
||||
|
||||
- Text (2,3,i) centered at 0.5, 0.5. *Hint*: check the parameter `ha` of `text` |
||||
- a title: Title i |
After Width: | Height: | Size: 13 KiB |
@ -0,0 +1,25 @@
|
||||
##### The exercice is validated is all questions of the exercice are validated |
||||
|
||||
##### The solution of question 1 is accepted if the plot reproduces the plot in the image and respect those criteria. |
||||
|
||||
###### Does it have a the title ? |
||||
###### Does it have a name on x-axis and y-axis ? |
||||
|
||||
|
||||
![alt text][logo_ex6] |
||||
|
||||
[logo_ex6]: ../w1day03_ex6_plot1.png "Time series ex6" |
||||
|
||||
|
||||
##### The solution of question 2 is accepted if the plot reproduces the plot in the image by using `plotly.graph_objects` and respect those criteria. |
||||
|
||||
|
||||
|
||||
2.This question is validated if the plot is in the image is reproduced using `plotly.graph_objects` given those criteria: |
||||
|
||||
###### Does it have a the title ? |
||||
###### Does it have a name on x-axis and y-axis ? |
||||
|
||||
![alt text][logo_ex6] |
||||
|
||||
[logo_ex6]: ../w1day03_ex6_plot1.png "Time series ex6" |
@ -0,0 +1,34 @@
|
||||
# Exercise 6 Plotly 1 |
||||
|
||||
Plotly has evolved a lot in the previous years. It is important to **always check the documentation**. |
||||
|
||||
Plotly comes with a high level interface: Plotly Express. It helps building some complex plots easily. The lesson won't detail the complex examples. Plotly express is quite interesting while using Pandas Dataframes because there are some built-in functions that leverage Pandas Dataframes. |
||||
|
||||
The plot outputed by Plotly is interactive and can also be dynamic. |
||||
|
||||
The goal of the exercise is to plot the price of a company. Its price is generated below. |
||||
|
||||
```python |
||||
returns = np.random.randn(50) |
||||
price = 100 + np.cumsum(returns) |
||||
|
||||
dates = pd.date_range(start='2020-09-01', periods=50, freq='B') |
||||
df = pd.DataFrame(zip(dates, price), |
||||
columns=['Date','Company_A']) |
||||
``` |
||||
|
||||
1. Using **Plotly express**, reproduce the plot in the image. As the data is generated randomly I do not expect you to reproduce the same line. |
||||
|
||||
![alt text][logo_ex6] |
||||
|
||||
[logo_ex6]: ./w1day03_ex6_plot1.png "Time series ex6" |
||||
|
||||
The plot has to contain: |
||||
|
||||
- title |
||||
- x-axis name |
||||
- yaxis name |
||||
|
||||
2. Same question but now using `plotly.graph_objects`. You may need to use `init_notebook_mode` from `plotly.offline`. |
||||
|
||||
https://plotly.com/python/time-series/e |
After Width: | Height: | Size: 43 KiB |
@ -0,0 +1,25 @@
|
||||
##### The solution of question 1 is accepted if the plot reproduces the plot in the image and respect those criteria. The code below shows a solution. |
||||
|
||||
###### Does it have a the title ? |
||||
###### Does it have a legend ? |
||||
|
||||
![alt text][logo_ex7] |
||||
|
||||
[logo_ex7]: ../w1day03_ex7_plot1.png "Box plot ex7" |
||||
|
||||
```python |
||||
import plotly.graph_objects as go |
||||
import numpy as np |
||||
|
||||
y0 = np.random.randn(50) |
||||
y1 = np.random.randn(50) + 1 # shift mean |
||||
y2 = np.random.randn(50) + 2 |
||||
|
||||
fig = go.Figure() |
||||
fig.add_trace(go.Box(y=y0, name='Sample A', |
||||
marker_color = 'indianred')) |
||||
fig.add_trace(go.Box(y=y1, name = 'Sample B', |
||||
marker_color = 'lightseagreen')) |
||||
|
||||
fig.show() |
||||
``` |
@ -0,0 +1,24 @@
|
||||
# Exercise 7 Plotly Box plots |
||||
|
||||
The goal of this exercise is to learn to use Plotly to plot Box Plots. It is t is a method for graphically depicting groups of numerical data through their quartiles and values as min, max. It allows to compare quickly some variables. |
||||
|
||||
Let us generate 3 random arrays from a normal distribution. And for each array add respectively 1, 2 to the normal distribution. |
||||
|
||||
```python |
||||
y0 = np.random.randn(50) |
||||
y1 = np.random.randn(50) + 1 # shift mean |
||||
y2 = np.random.randn(50) + 2 |
||||
``` |
||||
|
||||
1. Plot in the same Figure 2 box plots as shown in the image. In this exercise the style is not important. |
||||
|
||||
![alt text][logo_ex7] |
||||
|
||||
[logo_ex7]: ./w1day03_ex7_plot1.png "Box plot ex7" |
||||
|
||||
The plot has to contain: |
||||
|
||||
- the title |
||||
- the legend |
||||
|
||||
https://plotly.com/python/box-plots/ |
After Width: | Height: | Size: 13 KiB |
@ -0,0 +1,47 @@
|
||||
# W1D03 Piscine AI - Data Science |
||||
|
||||
## Visualizations |
||||
|
||||
While working on a dataset it is important to check the distribution of the data. Obviously, for most of humans it is difficult to visualize the data in more than 3 dimensions |
||||
|
||||
"Viz" is important to understand the data and to show results. We'll discover three libraries to visualize data in Python. These are one of the most used visualisation "libraries" in Python: |
||||
|
||||
- Pandas visualization module |
||||
- Matplotlib |
||||
- Plotly |
||||
|
||||
The goal is to understand the basics of those libraries. You'll have time during the project to master one (or the three) of them. |
||||
You may wonder why using one library is not enough. The reason is simple: it depends on the usage. |
||||
For example if you want to check the data quickly you may want to use Pandas viz module or Matplotlib. |
||||
If you want to plot a custom and more elaborated plot I suggest to use Matplotlib or Plotly. |
||||
And, if you want to create a very nice and interactive plot I suggest to use Plotly. |
||||
|
||||
|
||||
## Exercises of the day |
||||
|
||||
- Exercise 1 Pandas plot 1 |
||||
- Exercise 2 Pandas plot 2 |
||||
- Exercise 3 Matplotlib 1 |
||||
- Exercise 4 Matplotlib 2 |
||||
- Exercise 5 Matplotlib subplots |
||||
- Exercise 6 Plotly 1 |
||||
- Exercise 7 Plotly Box plots |
||||
|
||||
|
||||
## Virtual Environment |
||||
- Python 3.x |
||||
- NumPy |
||||
- Pandas |
||||
- Matplotlib |
||||
- Plotly |
||||
- Jupyter or JupyterLab |
||||
|
||||
I suggest to use the most recent version of the packages. |
||||
|
||||
## Resources |
||||
|
||||
- https://matplotlib.org/3.3.3/tutorials/index.html |
||||
- https://towardsdatascience.com/matplotlib-tutorial-learn-basics-of-pythons-powerful-plotting-library-b5d1b8f67596 |
||||
|
||||
- https://github.com/rougier/matplotlib-tutorial |
||||
- https://jakevdp.github.io/PythonDataScienceHandbook/05.13-kernel-density-estimation.html |
@ -0,0 +1,9 @@
|
||||
##### The exercice is validated is all questions of the exercice are validated |
||||
|
||||
##### Activate the virtual environment. If you used `conda` run `conda activate your_env` |
||||
|
||||
##### Run `python --version` |
||||
|
||||
###### Does it print `Python 3.x`? x >= 8 |
||||
|
||||
##### Does `import jupyter`, `import numpy` and `import pandas` run without any error ? |
@ -0,0 +1,55 @@
|
||||
# W1D04 Piscine AI - Data Science |
||||
|
||||
## Data wrangling with Pandas |
||||
|
||||
Data wrangling is one of the crucial tasks in data science and analysis which includes operations like: |
||||
|
||||
- Data Sorting: To rearrange values in ascending or descending order. |
||||
- Data Filtration: To create a subset of available data. |
||||
- Data Reduction: To eliminate or replace unwanted values. |
||||
- Data Access: To read or write data files. |
||||
- Data Processing: To perform aggregation, statistical, and similar operations on specific values. |
||||
Ax explained before, Pandas is an open source library, specifically developed for data science and analysis. It is built upon the Numpy (to handle numeric data in tabular form) package and has inbuilt data structures to ease-up the process of data manipulation, aka data munging/wrangling. |
||||
|
||||
## Exercises of the day |
||||
|
||||
- Exercise 1 Concatenate |
||||
- Exercise 2 Merge |
||||
- Exercise 3 Merge MultiIndex |
||||
- Exercise 4 Groupby Apply |
||||
- Exercise 5 Groupby Agg |
||||
- Exercise 6 Unstack |
||||
|
||||
## Virtual Environment |
||||
- Python 3.x |
||||
- NumPy |
||||
- Pandas |
||||
- Jupyter or JupyterLab |
||||
|
||||
*Version of Pandas I used to do the exercises: 1.0.1*. |
||||
I suggest to use the most recent one. |
||||
|
||||
## Resources |
||||
|
||||
- https://jakevdp.github.io/PythonDataScienceHandbook/ |
||||
|
||||
- https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf |
||||
|
||||
- https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/ |
||||
|
||||
- https://towardsdatascience.com/different-ways-to-iterate-over-rows-in-a-pandas-dataframe-performance-comparison-dc0d5dcef8fe |
||||
|
||||
|
||||
# Exercise 0 Environment and libraries |
||||
|
||||
The goal of this exercise is to set up the Python work environment with the required libraries. |
||||
|
||||
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries. |
||||
|
||||
I recommend to use: |
||||
|
||||
- the **last stable versions** of Python. |
||||
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science. |
||||
- one of the most recents versions of the libraries required |
||||
|
||||
1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy` and `jupyter`. |
@ -0,0 +1,8 @@
|
||||
##### This question is validated if the outputted DataFrame is: |
||||
|
||||
| | letter | number | |
||||
|---:|:---------|---------:| |
||||
| 0 | a | 1 | |
||||
| 1 | b | 2 | |
||||
| 2 | c | 1 | |
||||
| 3 | d | 2 | |
@ -0,0 +1,14 @@
|
||||
# Exercise 1 Concatenate |
||||
|
||||
The goal of this exercise is to learn to concatenate DataFrames. The logic is the same for the Series. |
||||
|
||||
Here are the two DataFrames to concatenate: |
||||
|
||||
```python |
||||
df1 = pd.DataFrame([['a', 1], ['b', 2]], |
||||
columns=['letter', 'number']) |
||||
df2 = pd.DataFrame([['c', 1], ['d', 2]], |
||||
columns=['letter', 'number']) |
||||
``` |
||||
|
||||
1. Concatenate this two DataFrames on index axis and reset the index. The index of the outputted should be `RangeIndex(start=0, stop=4, step=1)`. **Do not change the index manually**. |
@ -0,0 +1,23 @@
|
||||
##### The exercice is validated is all questions of the exercice are validated |
||||
|
||||
##### The question 1 is validated if the output is: |
||||
|
||||
| | id | Feature1_x | Feature2_x | Feature1_y | Feature2_y | |
||||
|---:|-----:|:-------------|:-------------|:-------------|:-------------| |
||||
| 0 | 1 | A | B | K | L | |
||||
| 1 | 2 | C | D | M | N | |
||||
|
||||
##### The question 2 is validated if the output is: |
||||
|
||||
| | id | Feature1_df1 | Feature2_df1 | Feature1_df2 | Feature2_df2 | |
||||
|---:|-----:|:---------------|:---------------|:---------------|:---------------| |
||||
| 0 | 1 | A | B | K | L | |
||||
| 1 | 2 | C | D | M | N | |
||||
| 2 | 3 | E | F | nan | nan | |
||||
| 3 | 4 | G | H | nan | nan | |
||||
| 4 | 5 | I | J | nan | nan | |
||||
| 5 | 6 | nan | nan | O | P | |
||||
| 6 | 7 | nan | nan | Q | R | |
||||
| 7 | 8 | nan | nan | S | T | |
||||
|
||||
Note: Check that the suffixes are set using the suffix parameters rather than manually changing the columns' name. |
@ -0,0 +1,46 @@
|
||||
# Exercise 2 Merge |
||||
|
||||
The goal of this exercise is to learn to merge DataFrames |
||||
The logic of merging DataFrames in Pandas is quite similar as the one used in SQL. |
||||
|
||||
Here are the two DataFrames to merge: |
||||
|
||||
```python |
||||
#df1 |
||||
|
||||
df1_dict = { |
||||
'id': ['1', '2', '3', '4', '5'], |
||||
'Feature1': ['A', 'C', 'E', 'G', 'I'], |
||||
'Feature2': ['B', 'D', 'F', 'H', 'J']} |
||||
|
||||
df1 = pd.DataFrame(df1_dict, columns = ['id', 'Feature1', 'Feature2']) |
||||
|
||||
#df2 |
||||
df2_dict = { |
||||
'id': ['1', '2', '6', '7', '8'], |
||||
'Feature1': ['K', 'M', 'O', 'Q', 'S'], |
||||
'Feature2': ['L', 'N', 'P', 'R', 'T']} |
||||
|
||||
df2 = pd.DataFrame(df2_dict, columns = ['id', 'Feature1', 'Feature2']) |
||||
``` |
||||
|
||||
1. Merge the two DataFrames to get this output: |
||||
|
||||
| | id | Feature1_x | Feature2_x | Feature1_y | Feature2_y | |
||||
|---:|-----:|:-------------|:-------------|:-------------|:-------------| |
||||
| 0 | 1 | A | B | K | L | |
||||
| 1 | 2 | C | D | M | N | |
||||
|
||||
2. Merge the two DataFrames to get this output: |
||||
|
||||
| | id | Feature1_df1 | Feature2_df1 | Feature1_df2 | Feature2_df2 | |
||||
|---:|-----:|:---------------|:---------------|:---------------|:---------------| |
||||
| 0 | 1 | A | B | K | L | |
||||
| 1 | 2 | C | D | M | N | |
||||
| 2 | 3 | E | F | nan | nan | |
||||
| 3 | 4 | G | H | nan | nan | |
||||
| 4 | 5 | I | J | nan | nan | |
||||
| 5 | 6 | nan | nan | O | P | |
||||
| 6 | 7 | nan | nan | Q | R | |
||||
| 7 | 8 | nan | nan | S | T | |
||||
|
@ -0,0 +1,14 @@
|
||||
##### The exercice is validated is all questions of the exercice are validated |
||||
|
||||
##### The question 1 is validated if the outputted DataFrame's shape is `(1305, 5)` and if `merged.head()` returns a table as below. One of the answers that returns the correct DataFrame is `market_data.merge(alternative_data, how='left', left_index=True, right_index=True)` |
||||
|
||||
| | Open | Close | Close_Adjusted | Twitter | Reddit | |
||||
|:-----------------------------------------------------|-----------:|----------:|-----------------:|------------:|----------:| |
||||
| (Timestamp('2021-01-01 00:00:00', freq='B'), 'AAPL') | 0.0991792 | -0.31603 | 0.634787 | -0.00159041 | 1.06053 | |
||||
| (Timestamp('2021-01-01 00:00:00', freq='B'), 'FB') | -0.123753 | 1.00269 | 0.713264 | 0.0142127 | -0.487028 | |
||||
| (Timestamp('2021-01-01 00:00:00', freq='B'), 'GE') | -1.37775 | -1.01504 | 1.2858 | 0.109835 | 0.04273 | |
||||
| (Timestamp('2021-01-01 00:00:00', freq='B'), 'AMZN') | 1.06324 | 0.841241 | -0.799481 | -0.805677 | 0.511769 | |
||||
| (Timestamp('2021-01-01 00:00:00', freq='B'), 'DAI') | -0.603453 | -2.06141 | -0.969064 | 1.49817 | 0.730055 | |
||||
|
||||
|
||||
##### The question 2 is validated if the numbers that are missing in the DataFrame are equal to 0 and if `filled_df.sum().sum() == merged_df.sum().sum()` gives: `True` |
@ -0,0 +1,34 @@
|
||||
# Exercise 3 Merge MultiIndex |
||||
|
||||
The goal of this exercise is to learn to merge DataFrames with MultiIndex. |
||||
Use the code below to generate the DataFrames. `market_data` contains fake market data. In finance, the market is available during the trading days (business days). `alternative_data` contains fake alternative data from social media. This data is available every day. But, for some reasons the Data Engineer lost the last 15 days of alternative data. |
||||
|
||||
1. Using `market_data` as the reference, merge `alternative_data` on `market_data` |
||||
|
||||
```python |
||||
#generate days |
||||
all_dates = pd.date_range('2021-01-01', '2021-12-15') |
||||
business_dates = pd.bdate_range('2021-01-01', '2021-12-31') |
||||
|
||||
#generate tickers |
||||
tickers = ['AAPL', 'FB', 'GE', 'AMZN', 'DAI'] |
||||
|
||||
#create indexs |
||||
index_alt = pd.MultiIndex.from_product([all_dates, tickers], names=['Date', 'Ticker']) |
||||
index = pd.MultiIndex.from_product([business_dates, tickers], names=['Date', 'Ticker']) |
||||
|
||||
# create DFs |
||||
market_data = pd.DataFrame(index=index, |
||||
data=np.random.randn(len(index), 3), |
||||
columns=['Open','Close','Close_Adjusted']) |
||||
|
||||
alternative_data = pd.DataFrame(index=index_alt, |
||||
data=np.random.randn(len(index_alt), 2), |
||||
columns=['Twitter','Reddit']) |
||||
``` |
||||
|
||||
`reset_index` is not allowed for this question |
||||
|
||||
2. Fill missing values with 0 |
||||
|
||||
- https://medium.com/swlh/merging-dataframes-with-pandas-pd-merge-7764c7e2d46d |
@ -0,0 +1,56 @@
|
||||
##### The exercice is validated is all questions of the exercice are validated and if the for loop hasn't been used. The goal is to use `groupby` and `apply`. |
||||
|
||||
##### The question 1 is validated if the output is: |
||||
|
||||
```python |
||||
df = pd.DataFrame(range(1,11), columns=['sequence']) |
||||
print(winsorize(df, [0.20, 0.80]).to_markdown()) |
||||
``` |
||||
|
||||
| | sequence | |
||||
|---:|-----------:| |
||||
| 0 | 2.8 | |
||||
| 1 | 2.8 | |
||||
| 2 | 3 | |
||||
| 3 | 4 | |
||||
| 4 | 5 | |
||||
| 5 | 6 | |
||||
| 6 | 7 | |
||||
| 7 | 8 | |
||||
| 8 | 8.2 | |
||||
| 9 | 8.2 | |
||||
|
||||
##### The question 2 is validated if the output is a Pandas Series or DataFrame with the first 11 rows equal to the output below. The code below give a solution. |
||||
|
||||
| | sequence | |
||||
|---:|-----------:| |
||||
| 0 | 1.45 | |
||||
| 1 | 2 | |
||||
| 2 | 3 | |
||||
| 3 | 4 | |
||||
| 4 | 5 | |
||||
| 5 | 6 | |
||||
| 6 | 7 | |
||||
| 7 | 8 | |
||||
| 8 | 9 | |
||||
| 9 | 9.55 | |
||||
| 10 | 11.45 | |
||||
|
||||
|
||||
```python |
||||
def winsorize(df_series, quantiles): |
||||
""" |
||||
df: pd.DataFrame or pd.Series |
||||
quantiles: list [0.05, 0.95] |
||||
|
||||
""" |
||||
min_value = np.quantile(df_series, quantiles[0]) |
||||
max_value = np.quantile(df_series, quantiles[1]) |
||||
|
||||
return df_series.clip(lower = min_value, upper = max_value) |
||||
|
||||
|
||||
df.groupby("group")[['sequence']].apply(winsorize, [0.05,0.95]) |
||||
``` |
||||
|
||||
- https://towardsdatascience.com/how-to-use-the-split-apply-combine-strategy-in-pandas-groupby-29e0eb44b62e |
@ -0,0 +1,65 @@
|
||||
# Exercise 4 Groupby Apply |
||||
|
||||
The goal of this exercise is to learn to group the data and apply a function on the groups. |
||||
The use case we will work on is computing |
||||
|
||||
1. Create a function that uses `pandas.DataFrame.clip` and that replace extreme values by a given percentile. The values that are greater than the upper percentile 80% are replaced by the percentile 80%. The values that are smaller than the lower percentile 20% are replaced by the percentile 20%. This process that correct outliers is called **winsorizing**. |
||||
I recommend to use NumPy to compute the percentiles to make sure we used the same default parameters. |
||||
|
||||
```python |
||||
def winsorize(df, quantiles): |
||||
""" |
||||
df: pd.DataFrame |
||||
quantiles: list |
||||
ex: [0.05, 0.95] |
||||
""" |
||||
#TODO |
||||
return |
||||
``` |
||||
|
||||
Here is what the function should output: |
||||
|
||||
```python |
||||
df = pd.DataFrame(range(1,11), columns=['sequence']) |
||||
print(winsorize(df, [0.20, 0.80]).to_markdown()) |
||||
|
||||
``` |
||||
|
||||
| | sequence | |
||||
|---:|-----------:| |
||||
| 0 | 2.8 | |
||||
| 1 | 2.8 | |
||||
| 2 | 3 | |
||||
| 3 | 4 | |
||||
| 4 | 5 | |
||||
| 5 | 6 | |
||||
| 6 | 7 | |
||||
| 7 | 8 | |
||||
| 8 | 8.2 | |
||||
| 9 | 8.2 | |
||||
|
||||
2. Now we consider that each value belongs to a group. The goal is to apply the **winsorizing to each group**. In this question we use winsorizing values that are common: `[0.05,0.95]` as percentiles. Here is the new data set: |
||||
|
||||
```python |
||||
groups = np.concatenate([np.ones(10), np.ones(10)+1, np.ones(10)+2, np.ones(10)+3, np.ones(10)+4]) |
||||
|
||||
df = pd.DataFrame(data= zip(groups, |
||||
range(1,51)), |
||||
columns=["group", "sequence"]) |
||||
``` |
||||
|
||||
The expected output (first rows) is: |
||||
|
||||
| | sequence | |
||||
|---:|-----------:| |
||||
| 0 | 1.45 | |
||||
| 1 | 2 | |
||||
| 2 | 3 | |
||||
| 3 | 4 | |
||||
| 4 | 5 | |
||||
| 5 | 6 | |
||||
| 6 | 7 | |
||||
| 7 | 8 | |
||||
| 8 | 9 | |
||||
| 9 | 9.55 | |
||||
| 10 | 11.45 | |
@ -0,0 +1,8 @@
|
||||
##### The question is validated if the output is as below. The columns don't have to be MultiIndex. A solution could be `df.groupby('product').agg({'value':['min','max','mean']})` |
||||
|
||||
|
||||
| product | ('value', 'min') | ('value', 'max') | ('value', 'mean') | |
||||
|:-------------|-------------------:|-------------------:|--------------------:| |
||||
| chair | 22.89 | 32.12 | 27.505 | |
||||
| mobile phone | 100 | 111.22 | 105.61 | |
||||
| table | 20.45 | 99.99 | 51.22 | |
@ -0,0 +1,23 @@
|
||||
# Exercise 5 Groupby Agg |
||||
|
||||
The goal of this exercise is to learn to compute different type of aggregations on the groups. This small DataFrame contains products and prices. |
||||
|
||||
| | value | product | |
||||
|---:|--------:|:-------------| |
||||
| 0 | 20.45 | table | |
||||
| 1 | 22.89 | chair | |
||||
| 2 | 32.12 | chair | |
||||
| 3 | 111.22 | mobile phone | |
||||
| 4 | 33.22 | table | |
||||
| 5 | 100 | mobile phone | |
||||
| 6 | 99.99 | table | |
||||
|
||||
1. Compute the min, max and mean price for each product in one single line of code. The expected output is: |
||||
|
||||
| product | ('value', 'min') | ('value', 'max') | ('value', 'mean') | |
||||
|:-------------|-------------------:|-------------------:|--------------------:| |
||||
| chair | 22.89 | 32.12 | 27.505 | |
||||
| mobile phone | 100 | 111.22 | 105.61 | |
||||
| table | 20.45 | 99.99 | 51.22 | |
||||
|
||||
Note: The columns don't have to be MultiIndex |
@ -0,0 +1,32 @@
|
||||
# Exercise 6 Unstack |
||||
|
||||
The goal of this exercise is to learn to unstack a MultiIndex |
||||
Let's assume we trained a machine learning model that predicts a daily score on the companies (tickers) below. It may be very useful to unstack the MultiIndex: plot the time series, vectorize the backtest, ... |
||||
|
||||
```python |
||||
business_dates = pd.bdate_range('2021-01-01', '2021-12-31') |
||||
|
||||
#generate tickers |
||||
tickers = ['AAPL', 'FB', 'GE', 'AMZN', 'DAI'] |
||||
|
||||
#create indexs |
||||
index = pd.MultiIndex.from_product([business_dates, tickers], names=['Date', 'Ticker']) |
||||
|
||||
# create DFs |
||||
market_data = pd.DataFrame(index=index, |
||||
data=np.random.randn(len(index), 1), |
||||
columns=['Prediction']) |
||||
|
||||
``` |
||||
|
||||
1. Unstack the DataFrame. |
||||
|
||||
The first 3 rows of the DataFrame should like this: |
||||
|
||||
| Date | ('Prediction', 'AAPL') | ('Prediction', 'AMZN') | ('Prediction', 'DAI') | ('Prediction', 'FB') | ('Prediction', 'GE') | |
||||
|:--------------------|-------------------------:|-------------------------:|------------------------:|-----------------------:|-----------------------:| |
||||
| 2021-01-01 00:00:00 | 0.382312 | -0.072392 | -0.551167 | -0.0585555 | 1.05955 | |
||||
| 2021-01-04 00:00:00 | -0.560953 | 0.503199 | -0.79517 | -3.23136 | 1.50271 | |
||||
| 2021-01-05 00:00:00 | 0.211489 | 1.84867 | 0.287906 | -1.81119 | 1.20321 | |
||||
|
||||
2. Plot the 5 times series in the same plot using Pandas built-in visualization functions with a title. |
@ -0,0 +1,40 @@
|
||||
# W1D04 Piscine AI - Data Science |
||||
|
||||
## Data wrangling with Pandas |
||||
|
||||
Data wrangling is one of the crucial tasks in data science and analysis which includes operations like: |
||||
|
||||
- Data Sorting: To rearrange values in ascending or descending order. |
||||
- Data Filtration: To create a subset of available data. |
||||
- Data Reduction: To eliminate or replace unwanted values. |
||||
- Data Access: To read or write data files. |
||||
- Data Processing: To perform aggregation, statistical, and similar operations on specific values. |
||||
Ax explained before, Pandas is an open source library, specifically developed for data science and analysis. It is built upon the Numpy (to handle numeric data in tabular form) package and has inbuilt data structures to ease-up the process of data manipulation, aka data munging/wrangling. |
||||
|
||||
## Exercises of the day |
||||
|
||||
- Exercise 1 Concatenate |
||||
- Exercise 2 Merge |
||||
- Exercise 3 Merge MultiIndex |
||||
- Exercise 4 Groupby Apply |
||||
- Exercise 5 Groupby Agg |
||||
- Exercise 6 Unstack |
||||
|
||||
## Virtual Environment |
||||
- Python 3.x |
||||
- NumPy |
||||
- Pandas |
||||
- Jupyter or JupyterLab |
||||
|
||||
*Version of Pandas I used to do the exercises: 1.0.1*. |
||||
I suggest to use the most recent one. |
||||
|
||||
## Resources |
||||
|
||||
- https://jakevdp.github.io/PythonDataScienceHandbook/ |
||||
|
||||
- https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf |
||||
|
||||
- https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/ |
||||
|
||||
- https://towardsdatascience.com/different-ways-to-iterate-over-rows-in-a-pandas-dataframe-performance-comparison-dc0d5dcef8fe |
@ -0,0 +1,9 @@
|
||||
##### The exercice is validated is all questions of the exercice are validated |
||||
|
||||
##### Activate the virtual environment. If you used `conda` run `conda activate your_env` |
||||
|
||||
##### Run `python --version` |
||||
|
||||
###### Does it print `Python 3.x`? x >= 8 |
||||
|
||||
##### Does `import jupyter`, `import numpy` and `import pandas` run without any error ? |
@ -0,0 +1,52 @@
|
||||
# W1D05 Piscine AI - Data Science |
||||
|
||||
## Time Series with Pandas |
||||
|
||||
Time series data are data that are indexed by a sequence of dates or times. Today, you'll learn how to use methods built into Pandas to work with this index. You'll also learn for instance: |
||||
- to resample time series to change the frequency |
||||
- to calculate rolling and cumulative values for times series |
||||
- to build a backtest |
||||
|
||||
Time series a used A LOT in finance. You'll learn to evaluate financial strategies using Pandas. It is important to keep in mind that Python is vectorized. That's why some questions constraint you to not use a for loop ;-). |
||||
|
||||
## Exercises of the day |
||||
|
||||
- Exercise 1 Series |
||||
- Exercise 2 Financial data |
||||
- Exercise 3 Multi asset returns |
||||
- Exercise 4 Backtest |
||||
|
||||
|
||||
## Virtual Environment |
||||
- Python 3.x |
||||
- NumPy |
||||
- Pandas |
||||
- Jupyter or JupyterLab |
||||
|
||||
*Version of Pandas I used to do the exercises: 1.0.1*. |
||||
I suggest to use the most recent one. |
||||
|
||||
## Resources |
||||
|
||||
- https://jakevdp.github.io/PythonDataScienceHandbook/ |
||||
|
||||
- https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf |
||||
|
||||
- https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/ |
||||
|
||||
- https://towardsdatascience.com/different-ways-to-iterate-over-rows-in-a-pandas-dataframe-performance-comparison-dc0d5dcef8fe |
||||
|
||||
|
||||
# Exercise 0 Environment and libraries |
||||
|
||||
The goal of this exercise is to set up the Python work environment with the required libraries. |
||||
|
||||
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries. |
||||
|
||||
I recommend to use: |
||||
|
||||
- the **last stable versions** of Python. |
||||
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science. |
||||
- one of the most recents versions of the libraries required |
||||
|
||||
1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy` and `jupyter`. |
@ -0,0 +1,35 @@
|
||||
##### The exercice is validated is all questions of the exercice are validated |
||||
|
||||
##### The question 1 is validated if the output of is as below. The best solution uses `pd.date_range` to generate the index and `range` to generate the integer series. |
||||
|
||||
```console |
||||
2010-01-01 0 |
||||
2010-01-02 1 |
||||
2010-01-03 2 |
||||
2010-01-04 3 |
||||
2010-01-05 4 |
||||
... |
||||
2020-12-27 4013 |
||||
2020-12-28 4014 |
||||
2020-12-29 4015 |
||||
2020-12-30 4016 |
||||
2020-12-31 4017 |
||||
Freq: D, Name: integer_series, Length: 4018, dtype: int64 |
||||
``` |
||||
|
||||
##### This question is validated if the output is as below. If the `NaN` values have been dropped the solution is also accepted. The solution uses `rolling().mean()`. |
||||
|
||||
```console |
||||
2010-01-01 NaN |
||||
2010-01-02 NaN |
||||
2010-01-03 NaN |
||||
2010-01-04 NaN |
||||
2010-01-05 NaN |
||||
... |
||||
2020-12-27 4010.0 |
||||
2020-12-28 4011.0 |
||||
2020-12-29 4012.0 |
||||
2020-12-30 4013.0 |
||||
2020-12-31 4014.0 |
||||
Freq: D, Name: integer_series, Length: 4018, dtype: float64 |
||||
``` |
@ -0,0 +1,7 @@
|
||||
# Exercise 1 |
||||
|
||||
The goal of this exercise is to learn to manipulate time series in Pandas. |
||||
|
||||
1. Create a `Series` named `integer_series` from 1st January 2010 to 31 December 2020. At each date is associated the number of days since 1st January 2010. It starts with 0. |
||||
|
||||
2. Using Pandas, compute a 7 days moving average. This transformation smooths the time series by removing small fluctuations. **without for loop** |
@ -0,0 +1,43 @@
|
||||
##### The exercice is validated is all questions of the exercice are validated |
||||
|
||||
###### Have you checked missing values and data types ? |
||||
###### Have you converted string dates to datetime ? |
||||
###### Have you set dates as index ? |
||||
###### Have you used `info` or `describe` to have a first look at the data ? |
||||
|
||||
|
||||
##### The question 1 is validated if you inserted the right columns in `Candlestick` `Plotly` object. The Candlestick is based on Open, High, Low and Close columns. The index is Date (datetime). |
||||
|
||||
##### This question 2 is validated if the output of `print(transformed_df.head().to_markdown())` is as below and if there are **482 months**. |
||||
|
||||
| Date | Open | Close | Volume | High | Low | |
||||
|:--------------------|---------:|---------:|------------:|---------:|---------:| |
||||
| 1980-12-31 00:00:00 | 0.136075 | 0.135903 | 1.34485e+09 | 0.161272 | 0.112723 | |
||||
| 1981-01-30 00:00:00 | 0.141768 | 0.141316 | 6.08989e+08 | 0.155134 | 0.126116 | |
||||
| 1981-02-27 00:00:00 | 0.118215 | 0.117892 | 3.21619e+08 | 0.128906 | 0.106027 | |
||||
| 1981-03-31 00:00:00 | 0.111328 | 0.110871 | 7.00717e+08 | 0.120536 | 0.09654 | |
||||
| 1981-04-30 00:00:00 | 0.121811 | 0.121545 | 5.36928e+08 | 0.131138 | 0.108259 | |
||||
|
||||
To get this result there are two ways: `resample` and `groupby`. There are two key steps: |
||||
|
||||
- Find how to affect the aggregation on the last **business** day of each month. This is already implemented in Pandas and the keyword that should be used either in `resample` parameter or in `Grouper` is `BM`. |
||||
- Choose the right aggregation function for each variable. The prices (Open, Close and Adjusted Close) should be aggregated by taking the `mean`. Low should be aggregated by taking the `minimum` because it represents the lower price of the day, so the lowest price on the month is the lowest price of the lowest prices on the day. The same logic applied to High, leads to use the `maximum` to aggregate the High. Volume should be aggregated using the `sum` because the monthly volume is equal to the sum of daily volume over the month. |
||||
|
||||
|
||||
##### The question 3 is validated if it doesn't involve a for loop and the output is as below. The first way to do it is to compute the return without for loop is to use `pct_change`. And the second way to do it is to implement the formula given in the exercise in a vectorized way. To get the value at `t-1` you can use `shift`. |
||||
|
||||
```console |
||||
Date |
||||
1980-12-12 NaN |
||||
1980-12-15 -0.047823 |
||||
1980-12-16 -0.073063 |
||||
1980-12-17 0.019703 |
||||
1980-12-18 0.028992 |
||||
... |
||||
2021-01-25 0.049824 |
||||
2021-01-26 0.003704 |
||||
2021-01-27 -0.001184 |
||||
2021-01-28 -0.027261 |
||||
2021-01-29 -0.026448 |
||||
Name: Open, Length: 10118, dtype: float64 |
||||
``` |
@ -0,0 +1,16 @@
|
||||
# Exercise 2 |
||||
|
||||
The goal of this exercise is to learn to use Pandas on Time Series an on Financial data. |
||||
|
||||
The data we will use is Apple stock. |
||||
|
||||
1. Using `Plotly` plot a Candlestick |
||||
|
||||
2. Aggregate the data to **last business day of each month**. The aggregation should consider the meaning of the variables. How many months are in the considered period ? |
||||
|
||||
3. When comparing many stocks between them the metric which is frequently used is the return of the price. The price is not a convenient metric as the prices evolve in different ranges. The return at time t is defined as |
||||
|
||||
- (Price(t) - Price(t-1))/ Price(t-1) |
||||
|
||||
Using the open price compute the **daily return**. Propose two different ways **without for loop**. |
||||
|
@ -0,0 +1,6 @@
|
||||
##### This question is validated if, without having used a for loop, the outputted DataFrame shape's `(261, 5)` and your output is the same as the one return with this line of code. The DataFrame contains random data. Make sure your output and the one returned by this code is based on the same DataFrame. |
||||
|
||||
```python |
||||
market_data.loc[market_data.index.get_level_values('Ticker')=='AAPL'].sort_index().pct_change() |
||||
``` |
||||
|
@ -0,0 +1,31 @@
|
||||
# Exercise 3 Multi asset returns |
||||
|
||||
The goal of this exercise is to learn to compute daily returns on a DataFrame that contains many assets (multi-assets). |
||||
|
||||
```python |
||||
business_dates = pd.bdate_range('2021-01-01', '2021-12-31') |
||||
|
||||
#generate tickers |
||||
tickers = ['AAPL', 'FB', 'GE', 'AMZN', 'DAI'] |
||||
|
||||
#create indexs |
||||
index = pd.MultiIndex.from_product([business_dates, tickers], names=['Date', 'Ticker']) |
||||
|
||||
# create DFs |
||||
market_data = pd.DataFrame(index=index, |
||||
data=np.random.randn(len(index), 1), |
||||
columns=['Price']) |
||||
``` |
||||
|
||||
1. **Without using a for loop**, compute the daily returns (return(d) = (price(d)-price(d-1))/price(d-1)) for all the companies and returns a DataFrame as: |
||||
|
||||
| Date | ('Price', 'AAPL') | ('Price', 'AMZN') | ('Price', 'DAI') | ('Price', 'FB') | ('Price', 'GE') | |
||||
|:--------------------|--------------------:|--------------------:|-------------------:|------------------:|------------------:| |
||||
| 2021-01-01 00:00:00 | nan | nan | nan | nan | nan | |
||||
| 2021-01-04 00:00:00 | 1.01793 | 0.0512955 | 3.84709 | -0.503488 | 0.33529 | |
||||
| 2021-01-05 00:00:00 | -0.222884 | -1.64623 | -0.71817 | -5.5036 | -4.15882 | |
||||
|
||||
Note: The data is generated randomly, the values you may have a different results. But, this shows the expected DataFrame structure. |
||||
|
||||
`Hint use groupby` |
||||
|
@ -0,0 +1,62 @@
|
||||
##### The exercice is validated is all questions of the exercice are validated |
||||
|
||||
###### Have you checked missing values and data types ? |
||||
###### Have you converted string dates to datetime ? |
||||
###### Have you set dates as index ? |
||||
###### Have you used `info` or `describe` to have a first look at the data ? |
||||
|
||||
|
||||
|
||||
**My results can be reproduced using: `np.random.seed = 2712`. Given the versions of NumPy used I do not guaranty the reproducibility of the results - that is why I also explain the steps to get to the solution.** |
||||
|
||||
##### The question 1 is validated if the return is computed as: Return(t) = (Price(t+1) - Price(t))/Price(t) and returns this output. Note that if the index is not ordered in ascending order the futur return computed is wrong. The answer is also accepted if the returns is computed as in the exercise 2 and then shifted in the futur using `shift`, but I do not recommend this implementation as it adds missing values ! |
||||
|
||||
```console |
||||
Date |
||||
1980-12-12 -0.052170 |
||||
1980-12-15 -0.073403 |
||||
1980-12-16 0.024750 |
||||
1980-12-17 0.029000 |
||||
1980-12-18 0.061024 |
||||
... |
||||
2021-01-25 0.001679 |
||||
2021-01-26 -0.007684 |
||||
2021-01-27 -0.034985 |
||||
2021-01-28 -0.037421 |
||||
2021-01-29 NaN |
||||
Name: Daily_futur_returns, Length: 10118, dtype: float64 |
||||
``` |
||||
|
||||
An example of solution is: |
||||
|
||||
```python |
||||
def compute_futur_return(price): |
||||
return (price.shift(-1) - price)/price |
||||
|
||||
compute_futur_return(df['Adj Close']) |
||||
``` |
||||
|
||||
|
||||
##### The question 2 is validated if the index of the Series is the same as the index of the DataFrame. The data of the series can be generated using `np.random.randint(0,2,len(df.index)`. |
||||
|
||||
##### This question is validated if the Pnl is computed as: signal * futur_return. Both series should have the same index. |
||||
|
||||
```console |
||||
Date |
||||
1980-12-12 -0.052170 |
||||
1980-12-15 -0.073403 |
||||
1980-12-16 0.024750 |
||||
1980-12-17 0.029000 |
||||
1980-12-18 0.061024 |
||||
... |
||||
2021-01-25 0.001679 |
||||
2021-01-26 -0.007684 |
||||
2021-01-27 -0.034985 |
||||
2021-01-28 -0.037421 |
||||
2021-01-29 NaN |
||||
Name: PnL, Length: 10119, dtype: float64 |
||||
``` |
||||
|
||||
##### The question 4 is validated if you computed the return of the strategy as: `(Total earned - Total invested) / Total` invested. The result should be close to 0. The formula given could be simplified as `(PnLs.sum())/signal.sum()`. My return is: 0.00043546984088551553 because I invested 5147$ and I earned 5149$. |
||||
|
||||
##### The question is validated if you replaced the previous signal Series with 1s. Similarly as the previous question, we earned 10128$ and we invested 10118$ which leads to a return of 0.00112670194140969 (0.1%). |
@ -0,0 +1,44 @@
|
||||
# Exercise 4 Backtest |
||||
|
||||
The goal of this exercise is to learn to perform a backtest in Pandas. A backtest is a tool that allows you to know how a strategy would have performed retrospectively using historical data. In this exercise we will focus on the backtesting tool and not on how to build the best strategy. |
||||
|
||||
We will backtest a **long only** strategy on Apple Inc. Long only means that we only consider buying the stock. The input signal at date d says if the close price will increase at d+1. We assume that the input signal is available before the market closes. |
||||
|
||||
1. Drop the rows with missing values and compute the daily futur return on the Apple stock on the adjusted close price. The daily futur return means: **Return(t) = (Price(t+1) - Price(t))/Price(t)**. |
||||
There are some events as splits or dividents that artificially change the price of the stock. That is why the close price is adjusted to avoid to have outliers in the price data. |
||||
|
||||
2. Create a Series that contains a random boolean array with **p=0.5** |
||||
|
||||
```console |
||||
Here an example of the expected time series |
||||
2010-01-01 1 |
||||
2010-01-02 0 |
||||
2010-01-03 0 |
||||
2010-01-04 1 |
||||
2010-01-05 0 |
||||
Freq: D, Name: long_only_signal, dtype: int64 |
||||
``` |
||||
|
||||
- The information is this series should be interpreted this way: |
||||
- On the 2010-01-01 I receive `1` before the market closes meaning that, if I trust the signal, the close price of day d+1 will increase. I should buy the stock before the market closes. |
||||
- On the 2010-01-02 I receive `0` before the market closes meaning that,, if I trust the signal, the close price of day d+1 will not increase. I should not buy the stock. |
||||
|
||||
3. Backtest the signal created in Question 2. Here are some assumptions made to backtest this signal: |
||||
- When, at date d, the signal equals 1 we buy 1$ of stock just before the market closes and we sell the stock just before the market closes the next day. |
||||
- When, at date d, the signal equals 0, we do not buy anything. |
||||
- The profit is not reinvested, when invested, the amount is always 1$. |
||||
- Fees are not considered |
||||
|
||||
**The expected output** is a **Series that gives for each day the return of the strategy. The return of the strategy is the PnL (Profit and Losses) divided by the invested amount**. The PnL for day d is: |
||||
`(money earned this day - money invested this day)` |
||||
|
||||
Let's take the example of a 20% return for an invested amount of 1$. The PnL is `(1,2 - 1) = 0.2`. We notice that the PnL when the signal is 1 equals the daily return. The Pnl when the signal is 0 is 0. |
||||
By convention, we consider that the PnL of d is affected to day d and not d+1, even if the underlying return contains the information of d+1. |
||||
|
||||
**The usage of for loop is not allowed**. |
||||
|
||||
4. Compute the return of the strategy. The return of the strategy is defined as: `(Total earned - Total invested) / Total invested` |
||||
|
||||
5. Now the input signal is: **always buy**. Compute the daily PnL and the total PnL. Plot the daily PnL of Q5 and of Q3 on the same plot |
||||
|
||||
- https://www.investopedia.com/terms/b/backtesting.asp |
@ -0,0 +1,37 @@
|
||||
# W1D05 Piscine AI - Data Science |
||||
|
||||
## Time Series with Pandas |
||||
|
||||
Time series data are data that are indexed by a sequence of dates or times. Today, you'll learn how to use methods built into Pandas to work with this index. You'll also learn for instance: |
||||
- to resample time series to change the frequency |
||||
- to calculate rolling and cumulative values for times series |
||||
- to build a backtest |
||||
|
||||
Time series a used A LOT in finance. You'll learn to evaluate financial strategies using Pandas. It is important to keep in mind that Python is vectorized. That's why some questions constraint you to not use a for loop ;-). |
||||
|
||||
## Exercises of the day |
||||
|
||||
- Exercise 1 Series |
||||
- Exercise 2 Financial data |
||||
- Exercise 3 Multi asset returns |
||||
- Exercise 4 Backtest |
||||
|
||||
|
||||
## Virtual Environment |
||||
- Python 3.x |
||||
- NumPy |
||||
- Pandas |
||||
- Jupyter or JupyterLab |
||||
|
||||
*Version of Pandas I used to do the exercises: 1.0.1*. |
||||
I suggest to use the most recent one. |
||||
|
||||
## Resources |
||||
|
||||
- https://jakevdp.github.io/PythonDataScienceHandbook/ |
||||
|
||||
- https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf |
||||
|
||||
- https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/ |
||||
|
||||
- https://towardsdatascience.com/different-ways-to-iterate-over-rows-in-a-pandas-dataframe-performance-comparison-dc0d5dcef8fe |
@ -0,0 +1,105 @@
|
||||
# RAID01 - Backtesting on the SP500 - audit |
||||
|
||||
### Preliminary |
||||
|
||||
###### Does the structure of the project is as below ? |
||||
|
||||
``` |
||||
project |
||||
│ README.md |
||||
│ environment.yml |
||||
│ |
||||
└───data |
||||
│ │ sp500.csv |
||||
│ | prices.csv |
||||
│ |
||||
└───notebook |
||||
│ │ analysis.ipynb |
||||
| |
||||
|───scripts |
||||
| │ memory_reducer.py |
||||
| │ preprocessing.py |
||||
| │ create_signal.py |
||||
| | backtester.py |
||||
│ | main.py |
||||
│ |
||||
└───results |
||||
│ plots |
||||
│ results.txt |
||||
│ outliers.txt |
||||
|
||||
``` |
||||
|
||||
###### Does the readme file contain a description of the project, explain how to run the code from an empty environment, give a summary of the implementation of each python file and contain a conclusion that gives the performance of the strategy? |
||||
|
||||
|
||||
###### Does the environment contain all libraries used and their versions that are necessary to run the code ? |
||||
|
||||
###### Does the notebook contain a missing values analysis? **Example**: number of missing values per variables or per year |
||||
|
||||
###### Does the notebook contain an outliers analysis? |
||||
|
||||
###### Does the notebook contain a Histogram of average price for companies for all variables (saved the plot with the images) ? This is required only for `prices.csv` data. |
||||
|
||||
###### Does the notebook describe at least 5 outliers ('ticker', 'date', price) ? To check the outliers it is simple: Search the historical stock price on Google at the given date and compare. The price may fluctuate a bit. The goal here is not to match the historical price found on Google but to detect a huge difference between the price in our data and the real historical one. |
||||
|
||||
|
||||
Notes: |
||||
- For all questions always check the values are sorted by date. If not the answers are wrong. |
||||
- The plots are validated only if they contain a title |
||||
|
||||
## Python files |
||||
### 1. memory_reducer.py |
||||
|
||||
###### Do the `prices` data set weights less than **8MB** (Mega Bytes) ? |
||||
###### Do the `sp500` data set weights less than **0.15MB** (Mega Bytes) ? |
||||
###### Do the data type is greater than `np.float32` ? Smaller data type may alter the precision of the data. |
||||
|
||||
|
||||
### 2. preprocessing.py |
||||
|
||||
##### The data is agregated on a monthly period and only the last element is kept |
||||
##### The outliers are filtered out by removing all prices bigger than 10k $ and smaller than 0.1 $ |
||||
##### The historical return is computed using only current and past values. |
||||
##### The futur return is computed using only current and futur value. (Reminder: as the data is resampled monthly, computing the return is straightforward) |
||||
##### The outliers in the returns data is set to NaN for all returns not in the years 2008 and 2009. The filters are: return > 1 and return < -0.5. |
||||
##### The missing values are filled using the last value available **for the company**. `df.fillna(method='ffill')` is wrong because the previous value can be the return or price of another company. |
||||
##### The missing values that can't be filled using a the previous existing value are dropped. |
||||
##### The number of missing values is 0 |
||||
|
||||
Best practice: |
||||
|
||||
Do not fill the last values for the futur return because the values are missing because the data set ends at a given date. Filling the previous doesn't make sense. It makes more sense to drop the row because the backtest focuses on observed data. |
||||
|
||||
|
||||
### 3. create_signal.py |
||||
|
||||
##### The metric `average_return_1y` is added as a new column if the merged DataFrame. The metric is relative to a company. It is important to group the data by company first before to compute the average return over 1y. It is accepted to consider that one year is 12 consecutive rows. |
||||
|
||||
##### The signal is added as a new column to the merged DataFrame. The signal which is boolean indicates whether, within the same month, the company is in the top 20. The top 20 corresponds to the 20 companies with the 20 highest metric within the same month. The highest metric gets the rank 1 (if rank is used the parameter `ascending` should be set to `False`). |
||||
|
||||
### 4. backtester.py |
||||
|
||||
##### The PnL is computed by multiplying the signal `Series` by the **futur returns**. |
||||
|
||||
##### The return of the strategy is computed by dividing the PnL by the sum of the signal `Series`. |
||||
|
||||
##### The signal used on the SP500 is the `pd.Series([20,20,...,20])` |
||||
|
||||
##### The series used in the plot are the cumulative PnL. `cumsum` can be used |
||||
|
||||
##### The PnL on the full historical data is **smaller than 75$**. If not, it means that the outliers where not corrected correctly. |
||||
|
||||
###### Does the plot contain a title ? |
||||
###### Does the plot contain a legend ? |
||||
###### Does the plot contain a x-axis and y-axis name ? |
||||
|
||||
|
||||
![alt text][performance] |
||||
|
||||
[performance]: ../images/w1_weekend_plot_pnl.png "Cumulative Performance" |
||||
|
||||
|
||||
### 5. main.py |
||||
|
||||
###### The command `python main.py` executes the code from data imports to the backtest and save the results? It shouldn't return any error to validate the project. |
After Width: | Height: | Size: 136 KiB |
@ -0,0 +1,155 @@
|
||||
# RAID01 - Backtesting on the SP500 |
||||
|
||||
## SP500 data preprocessing |
||||
|
||||
The goal of this project is to perform a Backtest on the SP500 constituents. The SP500 is an index the 500 biggest capitalization in the US. |
||||
|
||||
## Data |
||||
|
||||
The input file are `stock_prices.csv` and : |
||||
|
||||
- `sp500.csv` contains the SP500 data. The SP500 is a stock market index that measures the stock performance of 500 large companies listed on stock exchanges in the United States. |
||||
|
||||
- `stock_prices.csv`: contains the close prices for all the companies that had been in the SP500. It contains a lot of missing data. The adjusted close price may be unavailable for three main reasons: |
||||
|
||||
- The company doesn't exist at date d |
||||
- The company is not public, pas coté |
||||
- Its close price hasn't been reported |
||||
- Note: The quality of this data set is not good: some prices are wrong, there are some prices spikes, there are some prices adjustments (share split, dividend distribution) - the prices adjustment are corrected in the adjusted close. But I'm not providing this data for this project to let you understand what is bad quality data and how important it is to detect outliers and missing values. The idea is not to correct the full data set manually, but to correct the main problems. |
||||
|
||||
_Note: The corrections will not fix the data, as a result the results may be abnormal compared to results from cleaned financial data. That's not a problem for this small project !_ |
||||
|
||||
## Problem |
||||
|
||||
Once preprocessed this data, it will be used to generate a signal that is, for each asset at each date a metric that indicates if the asset price will increase the next month. At each date (once a month) we will take the 20 highest metrics and invest 1$ per company. This strategy is called **stock picking**. It consists in picking stock in an index and try to overperform the index. Finally we will compare the performance of our strategy compared to the benchmark: the SP500 |
||||
|
||||
It is important to understand that the SP500 components change over time. The reason is simple: Facebook entered the SP500 in 2013 thus meaning that another company had to be removed from the 500 companies. |
||||
|
||||
The structure of the project is: |
||||
|
||||
```console |
||||
project |
||||
│ README.md |
||||
│ environment.yml |
||||
│ |
||||
└───data |
||||
│ │ sp500.csv |
||||
│ | prices.csv |
||||
│ |
||||
└───notebook |
||||
│ │ analysis.ipynb |
||||
| |
||||
|───scripts |
||||
| │ memory_reducer.py |
||||
| │ preprocessing.py |
||||
| │ create_signal.py |
||||
| | backtester.py |
||||
│ | main.py |
||||
│ |
||||
└───results |
||||
│ plots |
||||
│ results.txt |
||||
│ outliers.txt |
||||
``` |
||||
|
||||
There are four parts: |
||||
|
||||
## 1. Preliminary |
||||
|
||||
- Create a function that takes as input one CSV data file. This function should optimize the types to reduce its size and returns a memory optimized DataFrame. |
||||
- For `float` data the smaller data type used is `np.float32` |
||||
- These steps may help you to implement the memory_reducer: |
||||
|
||||
1. Iterate over every column |
||||
2. Determine if the column is numeric |
||||
3. Determine if the column can be represented by an integer |
||||
4. Find the min and the max value |
||||
5. Determine and apply the smallest datatype that can fit the range of values |
||||
|
||||
## 2. Data wrangling and preprocessing |
||||
|
||||
- Create a Jupyter Notebook to analyse the data sets and perform EDA (Exploratory Data Analysis). This notebook should contain at least: |
||||
|
||||
- Missing values analysis |
||||
- Outliers analysis (there are a lot of outliers) |
||||
- One of average price for companies for all variables (save the plot with the images). |
||||
- Describe at least 5 outliers ('ticker', 'date', 'price'). Put them in `outliers.txt` file with the 3 fields on the folder `results`. |
||||
|
||||
_Note: create functions that generate the plots and save them in the images folder. Add a parameter `plot` with a default value `False` which doesn't return the plot. This will be useful for the correction to let people run your code without overriding your plots._ |
||||
|
||||
- Here is how the `prices` data should be preprocessed: |
||||
|
||||
- Resample data on month and keep the last value |
||||
- Filter prices outliers: Remove prices outside of the range 0.1$, 10k$ |
||||
- Compute monthly returns: |
||||
|
||||
- Historical returns. **returns(current month) = price(current month) - price(previous month) / price(previous month)** |
||||
- Future returns. **returns(current month) = price(next month) - price(current month) / price(current month)** |
||||
|
||||
- Replace returns outliers by the last value available regarding the company. This corrects prices spikes that corresponds to a monthly return greater than 1 and smaller than -0.5. This correction should not consider the 2008 and 2009 period as the financial crisis impacted the market brutally. **Don't forget that a value is considered as an outlier comparing to the other returns/prices of the same company** |
||||
|
||||
At this stage the DataFrame should looks like this: |
||||
|
||||
| | Price | monthly_past_return | monthly_future_return | |
||||
| :--------------------------------------------------- | ------: | ------------------: | -------------------: | |
||||
| (Timestamp('2000-12-31 00:00:00', freq='M'), 'A') | 36.7304 | nan | -0.00365297 | |
||||
| (Timestamp('2000-12-31 00:00:00', freq='M'), 'AA') | 25.9505 | nan | 0.101194 | |
||||
| (Timestamp('2000-12-31 00:00:00', freq='M'), 'AAPL') | 1.00646 | nan | 0.452957 | |
||||
| (Timestamp('2000-12-31 00:00:00', freq='M'), 'ABC') | 11.4383 | nan | -0.0528713 | |
||||
| (Timestamp('2000-12-31 00:00:00', freq='M'), 'ABT') | 38.7945 | nan | -0.07205 | |
||||
|
||||
- Fill the missing values using the last available value (same company) |
||||
- Drop the missing values that can't be filled |
||||
- Print `prices.isna().sum()` |
||||
|
||||
- Here is how the `sp500.csv` data should be preprocessed: |
||||
|
||||
- Resample data on month and keep the last value |
||||
- Compute historical monthly returns on the adjusted close |
||||
|
||||
## 3. Create signal |
||||
|
||||
At this stage we have a data set with features that we will leverage to get an investment signal. As previously said, we will focus on one single variable to create the signal: **monthly_past_return**. The signal will be the average of monthly returns of the previous year |
||||
|
||||
The naive assumption made here is that if a stock has performed well the last year it will perform well the next month. Moreover, we assume that we can buy stocks as soon as we have the signal (the signal is available at the close of day `d` and we assume that we can buy the stock at close of day `d`. The assumption is acceptable while considering monthly returns, because the difference between the close of day `d` and the open of day `d+1` is small comparing to the monthly return) |
||||
|
||||
- Create a column `average_return_1y` |
||||
- Create a column named `signal` that contains `True` if `average_return_1y` is among the 20 highest in the month `average_return_1y`. |
||||
|
||||
## 4. Backtester |
||||
|
||||
At this stage we have an investment signal that indicates each month what are the 20 companies we should invest 1$ on (1$ each). In order to check the strategies and performance we will backtest our investment signal. |
||||
|
||||
- Compute the PnL and the total return of our strategy without a for loop. Save the results in a text file `results.txt` in the folder `results`. |
||||
- Compute the PnL and the total return of the strategy that consists in investing 20$ each day on the SP500. Compare. Save the results in a text file `results.txt` in the folder `results`. |
||||
- Create a plot that shows the performance of the strategy over time for the SP500 and the Stock Picking 20 strategy. |
||||
|
||||
A data point (x-axis: date, y-axis: cumulated_return) is: the **cumulated returns** from the beginning of the strategy at date `t`. Save the plot in the results folder. |
||||
|
||||
> This plot is used a lot in Finance because it helps to compare a custom strategy with in index. In that case we say that the SP500 is used as **benchmark** for the Stock Picking Strategy. |
||||
|
||||
![alt text][performance] |
||||
|
||||
[performance]: images/w1_weekend_plot_pnl.png 'Cumulative Performance' |
||||
|
||||
## 5. Main |
||||
|
||||
Here is a sketch of `main.py`. |
||||
|
||||
```python |
||||
# main.py |
||||
|
||||
# import data |
||||
prices, sp500 = memory_reducer(paths) |
||||
|
||||
# preprocessing |
||||
prices, sp500 = preprocessing(prices, sp500) |
||||
|
||||
# create signal |
||||
prices = create_signal(prices) |
||||
|
||||
#backtest |
||||
backtest(prices, sp500) |
||||
``` |
||||
|
||||
**The command `python main.py` executes the code from data imports to the backtest and save the results.** |
@ -0,0 +1,9 @@
|
||||
##### The exercice is validated is all questions of the exercice are validated |
||||
|
||||
##### Activate the virtual environment. If you used `conda` run `conda activate your_env` |
||||
|
||||
##### Run `python --version` |
||||
|
||||
###### Does it print `Python 3.x`? x >= 8 |
||||
|
||||
##### Does `import jupyter`, `import numpy`, `import pandas`, `import matplotlib` and `import sklearn` run without any error ? |
@ -0,0 +1,81 @@
|
||||
# W2D01 Piscine AI - Data Science |
||||
|
||||
![Alt Text](w2_day01_linear_regression_video.gif) |
||||
|
||||
## Linear regression with Scikit Learn |
||||
|
||||
The goal of this day is to understand practical Linear regression and supervised learning. |
||||
|
||||
The word "regression" was introduced by Sir Francis Galton (a cousin of C. Darwin) when he |
||||
studied the size of individuals within a progeny. He was trying to understand why |
||||
large individuals in a population appeared to have smaller children, more |
||||
close to the average population size; hence the introduction of the term "regression". |
||||
|
||||
Today we will learn a basic algorithm used in **supervised learning** : **The Linear Regression**. We will be using **Scikit-learn** which is a machine learning library. It is designed to interoperate with the Python libraries NumPy and Pandas. |
||||
|
||||
We will also learn progressively the Machine Learning methodology for supervised learning - today we will focus on evaluating a machine learning model by splitting the data set in a train set and a test set. |
||||
|
||||
## Exercises of the day |
||||
|
||||
- Exercise 0 Environment and libraries |
||||
- Exercise 1 Scikit-learn estimator |
||||
- Exercise 2 Linear regression in 1D |
||||
- Exercise 3 Train test split |
||||
- Exercise 4 Forecast diabetes progression |
||||
- Bonus: Exercise 5 Gradient Descent - **Optional** |
||||
|
||||
|
||||
## Virtual Environment |
||||
- Python 3.x |
||||
- NumPy |
||||
- Pandas |
||||
- Matplotlib |
||||
- Scikit Learn |
||||
- Jupyter or JupyterLab |
||||
|
||||
*Version of Scikit Learn I used to do the exercises: 0.22*. I suggest to use the most recent one. Scikit Learn 1.0 is finally available after ... 14 years. |
||||
|
||||
## Ressources |
||||
|
||||
### To start with Scikit-learn |
||||
|
||||
- https://scikit-learn.org/stable/tutorial/basic/tutorial.html |
||||
|
||||
- https://jakevdp.github.io/PythonDataScienceHandbook/05.02-introducing-scikit-learn.html |
||||
|
||||
- https://scikit-learn.org/stable/modules/linear_model.html |
||||
|
||||
### Machine learning methodology and algorithms |
||||
|
||||
- This course provides a broad introduction to machine learning, datamining, and statistical pattern recognition. Andrew Ng is a star in the Machine Learning community. I recommend to spend some time during the projects to focus on some algorithms. However, Python is not the language used for the course. https://www.coursera.org/learn/machine-learning |
||||
|
||||
- https://docs.microsoft.com/en-us/azure/machine-learning/algorithm-cheat-sheet |
||||
|
||||
- https://scikit-learn.org/stable/tutorial/index.html |
||||
|
||||
### Linear Regression |
||||
|
||||
- https://towardsdatascience.com/laymans-introduction-to-linear-regression-8b334a3dab09 |
||||
|
||||
- https://towardsdatascience.com/linear-regression-the-actually-complete-introduction-67152323fcf2 |
||||
|
||||
### Train test split |
||||
|
||||
- https://machinelearningmastery.com/train-test-split-for-evaluating-machine-learning-algorithms/ |
||||
|
||||
- https://developers.google.com/machine-learning/crash-course/training-and-test-sets/video-lecture?hl=en |
||||
|
||||
|
||||
# Exercise 0 Environment and libraries |
||||
|
||||
The goal of this exercise is to set up the Python work environment with the required libraries. |
||||
|
||||
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries. |
||||
|
||||
I recommend to use: |
||||
|
||||
- the **last stable versions** of Python. |
||||
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science. |
||||
- one of the most recents versions of the libraries required |
||||
|
||||
1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy`, `jupyter`, `matplotlib` and `scikit-learn`. |
@ -0,0 +1,22 @@
|
||||
##### The exercice is validated is all questions of the exercice are validated |
||||
|
||||
##### The question 1 is validated if the output of the fitted model is: |
||||
|
||||
```python |
||||
LinearRegression(copy_X=True, fit_intercept=[[1], [2.1], [3]], n_jobs=None, |
||||
normalize=[[1], [2], [3]]) |
||||
``` |
||||
|
||||
##### The question 2 is validated if the output is: |
||||
|
||||
```python |
||||
array([[3.96013289]]) |
||||
``` |
||||
|
||||
##### The question 3 is validated if the output is: |
||||
|
||||
```output |
||||
Coefficients: [[0.99667774]] |
||||
Intercept: [-0.02657807] |
||||
Score: 0.9966777408637874 |
||||
``` |
@ -0,0 +1,13 @@
|
||||
# Exercise 1 Scikit-learn estimator |
||||
|
||||
The goal of this exercise is to learn to fit a Scikit-learn estimator and use it to predict. |
||||
|
||||
```console |
||||
X, y = [[1],[2.1],[3]], [[1],[2],[3]] |
||||
``` |
||||
|
||||
1. Fit a LinearRegression from Scikit-learn with X the features and y the target. |
||||
|
||||
2. Predict for `x_pred = [[4]]` |
||||
|
||||
3. Print the coefficients (`coefs_`) and the intercept (`intercept_`), the score (`score`) of the regression of X and y. |