Browse Source

fix: update audits to new format

pull/42/head
Badr Ghazlane 2 years ago
parent
commit
2394988fb7
  1. 6
      one_exercise_per_file/week01/day01/ex01/audit/readme.md
  2. 6
      one_exercise_per_file/week01/day01/ex02/audit/readme.md
  3. 11
      one_exercise_per_file/week01/day01/ex03/audit/readme.md
  4. 12
      one_exercise_per_file/week01/day01/ex04/audit/readme.md
  5. 10
      one_exercise_per_file/week01/day01/ex05/audit/readme.md
  6. 8
      one_exercise_per_file/week01/day01/ex06/audit/readme.md
  7. 18
      one_exercise_per_file/week01/day02/ex01/audit/readme.md
  8. 40
      one_exercise_per_file/week01/day02/ex02/audit/readme.md
  9. 47
      one_exercise_per_file/week01/day02/ex03/audit/readme.md
  10. 23
      one_exercise_per_file/week01/day02/ex04/audit/readme.md
  11. 5
      one_exercise_per_file/week01/day03/ex01/audit/readme.md
  12. 6
      one_exercise_per_file/week01/day03/ex02/audit/readme.md
  13. 13
      one_exercise_per_file/week01/day03/ex03/audit/readme.md
  14. 13
      one_exercise_per_file/week01/day03/ex04/audit/readme.md
  15. 17
      one_exercise_per_file/week01/day03/ex05/audit/readme.md
  16. 22
      one_exercise_per_file/week01/day03/ex06/audit/readme.md
  17. 8
      one_exercise_per_file/week01/day03/ex07/audit/readme.md
  18. 4
      one_exercise_per_file/week01/day04/ex01/audit/readme.md
  19. 6
      one_exercise_per_file/week01/day04/ex02/audit/readme.md
  20. 9
      one_exercise_per_file/week01/day04/ex03/audit/readme.md
  21. 39
      one_exercise_per_file/week01/day04/ex04/audit/readme.md
  22. 7
      one_exercise_per_file/week01/day04/ex05/audit/readme.md
  23. 7
      one_exercise_per_file/week01/day04/ex06/audit/readme.md
  24. 12
      one_exercise_per_file/week01/day05/ex01/audit/readme.md
  25. 23
      one_exercise_per_file/week01/day05/ex02/audit/readme.md
  26. 3
      one_exercise_per_file/week01/day05/ex03/audit/readme.md
  27. 29
      one_exercise_per_file/week01/day05/ex04/audit/readme.md

6
one_exercise_per_file/week01/day01/ex01/audit/readme.md

@ -1,5 +1,7 @@
1. This exercise is validated if the your_numpy_array is a NumPy array. It can be checked with `type(your_numpy_array)` that should be equal to `numpy.ndarray`. And if the type of is element are as follow.
##### This exercise is validated if the your_numpy_array is a NumPy array. It can be checked with `type(your_numpy_array)` that should be equal to `numpy.ndarray`. And if the type of is element are as follow.
##### Try and run the following code.
```python
for i in your_np_array:
@ -14,3 +16,5 @@ for i in your_np_array:
<class 'set'>
<class 'bool'>
```
###### Does it display the right types as above?

6
one_exercise_per_file/week01/day01/ex02/audit/readme.md

@ -1,3 +1,3 @@
1. The question is validated is the solution uses `np.zeros` and if the shape of the array is `(300,)`
2. The question is validated if the solution uses `reshape` and the shape of the array is `(3, 100)`
##### The exercice is validated is all questions of the exercice are validated
##### The question 1 is validated is the solution uses `np.zeros` and if the shape of the array is `(300,)`
##### The question 2 is validated if the solution uses `reshape` and the shape of the array is `(3, 100)`

11
one_exercise_per_file/week01/day01/ex03/audit/readme.md

@ -1,10 +1,13 @@
1. This question is validated if the solution doesn't involve a for loop or writing all integers from 1 to 100 and if the array is: `np.array([1,...,100])`. The list from 1 to 100 can be generated with an iterator: `range`.
##### The exercice is validated is all questions of the exercice are vaildated
2. This question is validated if the solution is: `integers[::2]`
3. This question is validated if the solution is: `integers[::-2]`
##### The question 1 is validated if the solution doesn't involve a for loop or writing all integers from 1 to 100 and if the array is: `np.array([1,...,100])`. The list from 1 to 100 can be generated with an iterator: `range`.
4. This question is validated if the array is: `np.array([0, 1,0,3,4,0,...,0,99,100])`. There are at least two ways to get this results without for loop. The first one uses `integers[1::3] = 0` and the second involves creating a boolean array that indexes the array:
##### The question 2 is validated if the solution is: `integers[::2]`
##### The question 3 is validated if the solution is: `integers[::-2]`
##### The question 4 is validated if the array is: `np.array([0, 1,0,3,4,0,...,0,99,100])`. There are at least two ways to get this results without for loop. The first one uses `integers[1::3] = 0` and the second involves creating a boolean array that indexes the array:
```python
mask = (integers+1)%3 == 0

12
one_exercise_per_file/week01/day01/ex04/audit/readme.md

@ -1,10 +1,12 @@
For this exercise, as the results may change depending on the version of the package or the OS, I give the code to correct the exercise. If the code is correct and the output is not the same as mine, it is accepted.
##### The exercice is validated is all questions of the exercice are validated
1. The solution is accepted if the solution is: `np.random.seed(888)`
##### For this exercise, as the results may change depending on the version of the package or the OS, I give the code to correct the exercise. If the code is correct and the output is not the same as mine, it is accepted.
2. The solution is accepted if the solution is `np.random.randn(100)`. The value of the first element is `0.17620087373662233`.
##### The question 1 is validated if the solution is: `np.random.seed(888)`
3. The solution is accepted if the solution is `np.random.randint(1,11,(8,8))`.
##### The question 2 is validated if the solution is: `np.random.randn(100)`. The value of the first element is `0.17620087373662233`.
##### The question 3 is validated if the solution is: `np.random.randint(1,11,(8,8))`.
```console
Given the NumPy version and the seed, you should have this output:
@ -19,7 +21,7 @@ For this exercise, as the results may change depending on the version of the pac
[ 4, 4, 9, 2, 8, 5, 9, 5]])
```
4. The solution is accepted if the solution is `np.random.randint(1,18,(4,2,5))`.
##### The question 4 is validated if the solution is: `np.random.randint(1,18,(4,2,5))`.
```console
Given the NumPy version and the seed, you should have this output:

10
one_exercise_per_file/week01/day01/ex05/audit/readme.md

@ -1,10 +1,12 @@
1. This question is validated if the generated array is based on an iterator as `range` or `np.arange`. Check that 50 is part of the array.
##### The exercice is validated is all questions of the exercice are validated
2. This question is validated if the generated array is based on an iterator as `range` or `np.arange`. Check that 100 is part of the array.
##### The question 1 is validated if the generated array is based on an iterator as `range` or `np.arange`. Check that 50 is part of the array.
3. This question is validated if you concatenated this way `np.concatenate(array1,array2)`.
##### The question 2 is validated if the generated array is based on an iterator as `range` or `np.arange`. Check that 100 is part of the array.
4. This question is validated if the result is:
##### The question 3 is validated if you concatenated this way `np.concatenate(array1,array2)`.
##### The question 4 is validated if the result is:
```console
array([[ 1, ... , 10],

8
one_exercise_per_file/week01/day01/ex06/audit/readme.md

@ -1,7 +1,9 @@
1. The question is validated if the output is the same as:
##### The exercice is validated is all questions of the exercice are validated
##### The question 1 is validated if the output is the same as:
`np.ones([9,9], dtype=np.int8)`
2. The question is validated if the output is
##### The question 2 is validated if the output is
```console
array([[1, 1, 1, 1, 1, 1, 1, 1, 1],
@ -15,7 +17,7 @@
[1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int8)
```
The solution is not accepted if the values of the array have been changed one by one manually. The usage of the for loop is not allowed neither.
##### The solution of question 2 is not accepted if the values of the array have been changed one by one manually. The usage of the for loop is not allowed neither.
Here is an example of a possible solution:
```console

18
one_exercise_per_file/week01/day02/ex01/audit/readme.md

@ -1,6 +1,8 @@
1. The solution is accepted if the DataFrame created is the same as the "model" DataFrame. Check that the index is not 1,2,3,4,5.
##### The exercice is validated is all questions of the exercice are validated
2. The solution is accepted if the types you get for the columns are
##### The solution of question 1 is accepted if the DataFrame created is the same as the "model" DataFrame. Check that the index is not 1,2,3,4,5.
##### The solution of question 2 is accepted if the types you get for the columns are as below and if if the types of the first value of the columns are as below
```console
<class 'pandas.core.series.Series'>
@ -8,10 +10,8 @@
<class 'pandas.core.series.Series'>
```
and if the types of the first value of the columns are
```console
<class 'str'>
<class 'list'>
<class 'float'>
```
```console
<class 'str'>
<class 'list'>
<class 'float'>
```

40
one_exercise_per_file/week01/day02/ex02/audit/readme.md

@ -1,6 +1,8 @@
1. `del` works but it is not a solution I recommend. For this exercise it is accepted. It is expected to use `drop` with `axis=1`. `inplace=True` may be useful to avoid to affect the result to a variable.
##### The exercice is validated is all questions of the exercice are validated
2. The preferred solution is `set_index` with `inplace=True`. As long as the DataFrame returns the output below, the solution is accepted. If the type of the index is not `dtype='datetime64[ns]'` the solution is not accepted.
##### The solution of question 1 is accepted if you use `drop` with `axis=1`.`inplace=True` may be useful to avoid to affect the result to a variable. A solution that could be accepted too (even if it's not a solution I recommend is `del`.
##### The solution of question 2 is accepted if the DataFrame returns the output below. If the type of the index is not `dtype='datetime64[ns]'` the solution is not accepted. I recommend to use `set_index` with `inplace=True` to do so.
```python
Input: df.head().index
@ -11,7 +13,7 @@
dtype='datetime64[ns]', name='Date', freq=None)
```
3. The preferred solution is `pd.to_numeric` with `coerce=True`. The solution is accepted if all types are `float64`.
##### The solution of question 3 is accepted if all the types are `float64` as below. The preferred solution is `pd.to_numeric` with `coerce=True`.
```python
Input: df.dtypes
@ -27,18 +29,26 @@
```
4. `df.describe()` is expected
##### The solution of question 4 is accepted if you use `df.describe()`.
##### The solution of question 5 is accepted if you used `dropna` and have the number of missing values equal to 0.You should have noticed that 25979 rows contain missing values (for a total of 129895). `df.isna().sum()` allows to check the number of missing values and `df.dropna()` with `inplace=True` allows to remove the rows with missing values.
##### The solution of question 6 is accepted if one of the two approaches below were used:
```python
#solution 1
df.loc[:,'A'] = (df['A'] + 1) * 0.06
5. You should have noticed that 25979 rows contain missing values (for a total of 129895). `df.isna().sum()` allows to check the number of missing values and `df.dropna()` with `inplace=True` allows to remove the rows with missing values. The solution is accepted if you used `dropna` and have the number of missing values as 0.
#solution 2
df.loc[:,'A'] = df.loc[:,'A'].apply(lambda x: (x+1)*0.06)
```
6. Two solutions are accepted:
- `df.loc[:,'A'] = (df['A'] + 1) * 0.06`
- Using `apply`: `df.loc[:,'A'] = df.loc[:,'A'].apply(lambda x: (x+1)*0.06)`
You may wonder `df.loc[:,'A']` is required and if `df['A'] = ...` works too. The answer is no. This is important in Pandas. Depending on the version of Pandas, it may return a warning. The reason is that you are affecting a value to a **copy** of the DataFrame and not in the DataFrame.
You may wonder `df.loc[:,'A']` is required and if `df['A'] = ...` works too. **The answer is no**. This is important in Pandas. Depending on the version of Pandas, it may return a warning. The reason is that you are affecting a value to a **copy** of the DataFrame and not in the DataFrame.
More details: https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas
7. The solution is accepted as long as the output of `print(filtered_df.head().to_markdown())` is:
##### The solution of question 7 is accepted as long as the output of `print(filtered_df.head().to_markdown())` is as below and if the number of rows is equal to **449667**.
| Date | Global_active_power | Global_reactive_power |
|:--------------------|----------------------:|------------------------:|
@ -48,9 +58,7 @@
| 2008-12-27 00:00:00 | 1.07 | 0.174 |
| 2008-12-27 00:00:00 | 0.804 | 0.184 |
Check that the number of rows is equal to **449667**.
8. The solution is accepted if output is
##### The solution of question 8 is accepted if the output is
```console
Global_active_power 0.254
@ -62,9 +70,9 @@
```
9. The solution is accepted if the output is `Timestamp('2009-02-22 00:00:00')`
##### The solution of question 9 if the output is `Timestamp('2009-02-22 00:00:00')`
10. The solution is accepted if the output for `print(sorted_df.tail().to_markdown())` is
##### The solution of question 10 if the output of `print(sorted_df.tail().to_markdown())` is
| Date | Global_active_power | Global_reactive_power | Voltage |
|:--------------------|----------------------:|------------------------:|----------:|
@ -74,7 +82,7 @@
| 2008-08-28 00:00:00 | 0.076 | 0 | 235.64 |
| 2008-12-08 00:00:00 | 0.076 | 0 | 236.5 |
11. The solution is based on `groupby` which creates groups based on the index `Date` and aggregates the groups using the `mean`. The solution is accepted if the output is
##### The solution of question 11 is accepted if the output is as below. The solution is based on `groupby` which creates groups based on the index `Date` and aggregates the groups using the `mean`.
```console
Date

47
one_exercise_per_file/week01/day02/ex03/audit/readme.md

@ -1,32 +1,23 @@
To validate this exercise all answers should return the expected numerical value given in the correction AND uses Pandas. For example using NumPy to compute the mean doesn't respect the philosophy of the exercise which is to use Pandas.
##### The exercice is validated is all questions of the exercice are validated
1. How many rows and columns are there?**10000 entries** and **14 columns**
##### To validate this exercise all answers should return the expected numerical value given in the correction AND uses Pandas. For example using NumPy to compute the mean doesn't respect the philosophy of the exercise which is to use Pandas.
There many solutions based on: shape, info, describe
#### The solution of question 1 is accepted if it contains **10000 entries** and **14 columns**. There many solutions based on: shape, info, describe.
2. What is the average Purchase Price? **50.34730200000025**
##### The solution of question 2 is accepted if the answer is **50.34730200000025**.
Even if `np.mean` gives the solution, `df['Purchase Price'].mean()` is preferred
3. What were the highest and lowest purchase prices?
##### The solution of question 3 is accepted if the min is `0`and the max is `99.989999999999995`
min: 0
max: 99.989999999999995
##### The solution of question 4 is accepted if the answer is **1098**
4. How many people have English `'en'` as their Language of choice on the website? **1098**
##### The solution of question 5 is accepted if the answer is **30**
5. How many people have the job title of `"Lawyer"` ? **30**
##### The solution of question 6 is accepted if the are `4932` people that made the purchase during the `AM` and `5068` people that made the purchase during `PM`. There many ways to the solution but the goal of this question was to make you use `value_counts`
6. How many people made the purchase during the `AM` and how many people made the purchase during `PM` ?
PM: 5068
AM: 4932
There many ways to the solution but the goal of this question was to make you use `value_counts`
7. What are the 5 most common Job Titles?
##### The solution of question 7 is accepted if the answer is as below. There many ways to the solution but the goal of this question was to make you use `value_counts`
Interior and spatial designer 31
@ -38,25 +29,21 @@ To validate this exercise all answers should return the expected numerical value
Designer, jewellery 27
There many ways to the solution but the goal of this question was to make you use `value_counts`
8. ##### The solution of question 8 is accepted if the purchase price is **75.1**
8. Someone made a purchase that came from Lot: `"90 WT"` , what was the Purchase Price for this transaction? **75.1**
9. What is the email of the person with the following Credit Card Number: `4926535242672853`. **bondellen@williams-garza.com**
10. How many people have American Express as their Credit Card Provider and made a purchase above `$95` ? **39**
##### The solution of question 9 is accepted if the email adress is **bondellen@williams-garza.com**
The prefered solution is based on this:
##### The solution of question 10 is accepted if the answer is **39**. The prefered solution is based on this: `df[(df['A'] == X) & (df['B'] > Y)]`
`df[(df['A'] == X) & (df['B'] > Y)]`
11. How many people have a credit card that expires in `2025`? **1033**
The preferred solution is based on the usage of `apply` on a `lambda` function that slices the string that contains the expiration date.
##### The solution of question 11 is accepted if the answer is **1033**. The preferred solution is based on the usage of `apply` on a `lambda` function that slices the string that contains the expiration date.
12. What are the top 5 most popular email providers/hosts (e.g. gmail.com, yahoo.com, etc...)
##### The solution of question 12 is accepted if the answer is as below. The preferred solution is based on the usage of `apply` on a `lambda` function that slices the string that contains the email. The `lambda` function uses `split` to split the string on `@`. Finally, `value_counts` is used to count the occurrences.
- hotmail.com 1638
- yahoo.com 1616
- gmail.com 1605
- smith.com 42
- williams.com 37
The preferred solution is based on the usage of `apply` on a `lambda` function that slices the string that contains the email. The `lambda` function uses `split` to split the string on `@`. Finally, `value_counts` is used to count the occurrences.
- williams.com 37

23
one_exercise_per_file/week01/day02/ex04/audit/readme.md

@ -1,14 +1,6 @@
1. This question is validated if you have done these two steps in that order:
##### The exercice is validated is all questions of the exercice are validated (except the bonus question)
- Convert the numerical columns to `float`
example:
```python
pd.to_numeric(df.loc[:,col], errors='coerce')
```
- Fill the missing values. There are many solutions for this step, here is one of them.
##### The solution of question 1 is accepted if you have done these two steps in that order. First, convert the numerical columns to `float` and then fill the missing values. The first step may involve `pd.to_numeric(df.loc[:,col], errors='coerce')`. The second step is validated if you eliminated all missing values. However there are many possibilities to fill the missing values. Here is one of them:
example:
@ -19,12 +11,10 @@
4:0})
```
##### The solution of question 2 is accepted if the solution is `df.loc[:,col].fillna(df[col].median())`.
2. This question is validated if the solution is: `df.loc[:,col].fillna(df[col].median())`
**Bonus questions**:
##### The solution of bonus question is accepted if you find out this answer: Once we filled the missing values as suggested in the first question, `df.describe()` returns this interesting summary. We notice that the mean is way higher than the median. It means that there are maybe some outliers in the data. The quantile 75 and the max confirm that: 75% of the flowers have a sepal length smaller than 6.4 cm, but the max is 6900 cm. If you check on the internet you realise this small flower can't be that big. The outliers have a major impact on the mean which equals to 56.9. Filling this value for the missing value is not correct since it doesn't correspond to the real size of this flower. That is why in that case the best strategy to fill the missing values is the median. The truth is that I modified the data set ! But real data sets ALWAYS contains outliers. Always think about the meaning of the data transformation ! If you fill the missing values by zero, it means that you consider that the length or width of some flowers may be 0. It doesn't make sense.
- It is important to understand why filling the missing values with 0 or the mean of the column is a bad idea.
| | sepal_length | sepal_width | petal_length | petal_width |
|:------|---------------:|--------------:|---------------:|--------------:|
@ -37,9 +27,6 @@
| 75% | 6.4 | 3.3 | 5.1 | 1.8 |
| max | 6900 | 3809 | 1400 | 1600 |
Once we filled the missing values as suggested in the first question, `df.describe()` returns this interesting summary. We notice that the mean is way higher than the median. It means that there are maybe some outliers in the data. The quantile 75 and the max confirm that: 75% of the flowers have a sepal length smaller than 6.4 cm, but the max is 6900 cm. If you check on the internet you realise this small flower can't be that big. The outliers have a major impact on the mean which equals to 56.9. Filling this value for the missing value is not correct since it doesn't correspond to the real size of this flower. That is why in that case the best strategy to fill the missing values is the median. The truth is that I modified the data set ! But real data sets ALWAYS contains outliers.
Always think about the meaning of the data transformation ! If you fill the missing values by zero, it means that you consider that the length or width of some flowers may be 0. It doesn't make sense.
- If you noticed that there are some negative values and the huge values, you will be a good data scientist. **YOU SHOULD ALWAYS TRY TO UNDERSTAND YOUR DATA**. Print the row with index 122 ;-)
This week, we will have the opportunity to focus on the data pre-processing to understand how the outliers are handled.
##### The solution of bonus question is accepted if you noticed that there are some negative values and the huge values, you will be a good data scientist. **YOU SHOULD ALWAYS TRY TO UNDERSTAND YOUR DATA**. Print the row with index 122 ;-) This week, we will have the opportunity to focus on the data pre-processing to understand how the outliers can be handled.

5
one_exercise_per_file/week01/day03/ex01/audit/readme.md

@ -1,5 +1,8 @@
1. This question is validated if the plot reproduces the plot in the image. It has to contain a title, an x-axis name and a legend.
##### The solution of question 1 is accepted if the plot reproduces the plot in the image and respect those criteria.
###### Does it have a the title ?
###### Does it have a name on x-axis ?
###### Does it have a legend ?
![alt text][logo]
[logo]: ../w1day03_ex1_plot1.png "Bar plot ex1"

6
one_exercise_per_file/week01/day03/ex02/audit/readme.md

@ -1,5 +1,7 @@
1. This question is validated if the plot reproduces the plot in the image. It has to contain a title, an x-axis name and an y-axis name.
You should also observe that the older people are, the the more children they have.
##### The solution of question 1 is accepted if the plot reproduces the plot in the image and respect the following criteria. You should also observe that the older people are, the the more children they have.
###### Does it have a the title ?
###### Does it have a name on x-axis and y-axis ?
![alt text][logo_ex2]

13
one_exercise_per_file/week01/day03/ex03/audit/readme.md

@ -1,11 +1,10 @@
1. This question is validated if the plot reproduces the plot in the image and respect those criteria
##### The solution of question 1 is accepted if the plot reproduces the plot in the image and respect those criteria.
- the title
- name on x-axis and y-axis
- x-axis and y-axis are limited to [1,8]
- **style**:
- red dashdot line with a width of 3
- blue circles with a size of 12
###### Does it have a the title ?
###### Does it have a name on x-axis and y-axis ?
###### Are the x-axis and y-axis limited to [1,8] ?
###### Is the line a red dashdot line with a width of 3 ?
###### Are the circles blue circles with a size of 12 ?
![alt text][logo_ex3]

13
one_exercise_per_file/week01/day03/ex04/audit/readme.md

@ -1,12 +1,9 @@
1. This question is validated if the plot reproduces the plot in the image and respect those criteria
##### The solution of question 1 is accepted if the plot reproduces the plot in the image and respect those criteria.
The plot has to contain:
- the title
- name on left y-axis and right y-axis
- **style**:
- left data in black
- right data in red
###### Does it have a the title ?
###### Does it have a name on x-axis and y-axis ?
###### Is the left data black ?
###### Is the right data red ?
![alt text][logo_ex4]

17
one_exercise_per_file/week01/day03/ex05/audit/readme.md

@ -1,16 +1,11 @@
1. The question is validated if the plot reproduces the image and the given criteria:
##### The solution of question 1 is accepted if the plot reproduces the plot in the image and respect those criteria.
The plot has to contain:
- 6 subplots: 2 rows, 3 columns
- Keep space between plots: `hspace=0.5` and `wspace=0.5`
- Each plot contains
- Text (2,3,i) centered at 0.5, 0.5. *Hint*: check the parameter `ha` of `text`
- a title: Title i
###### Does it contain 6 subplots (2 rows, 3 columns)?
###### Does it have space between plots (`hspace=0.5` and `wspace=0.5`)?
###### Do all subplots contain a title: `Title i` ?
###### Do all subplots contain a text `(2,3,i)` centered at `(0.5, 0.5)`? *Hint*: check the parameter `ha` of `text`
###### Have all subplots been created in a for loop ?
![alt text][logo_ex5]
[logo_ex5]: ../w1day03_ex5_plot1.png "Subplots ex5"
Check that the plot has been created with a for loop.

22
one_exercise_per_file/week01/day03/ex06/audit/readme.md

@ -1,22 +1,24 @@
1. This question is validated if the plot is in the image is reproduced using Plotly express given those criteria:
##### The exercice is validated is all questions of the exercice are validated
The plot has to contain:
##### The solution of question 1 is accepted if the plot reproduces the plot in the image and respect those criteria.
###### Does it have a the title ?
###### Does it have a name on x-axis and y-axis ?
- a title
- x-axis name
- yaxis name
![alt text][logo_ex6]
[logo_ex6]: ../w1day03_ex6_plot1.png "Time series ex6"
2.This question is validated if the plot is in the image is reproduced using `plotly.graph_objects` given those criteria:
The plot has to contain:
##### The solution of question 2 is accepted if the plot reproduces the plot in the image by using `plotly.graph_objects` and respect those criteria.
2.This question is validated if the plot is in the image is reproduced using `plotly.graph_objects` given those criteria:
- a title
- x-axis name
- yaxis name
###### Does it have a the title ?
###### Does it have a name on x-axis and y-axis ?
![alt text][logo_ex6]

8
one_exercise_per_file/week01/day03/ex07/audit/readme.md

@ -1,9 +1,7 @@
1. This question is validated if the plot is in the image is reproduced given those criteria:
##### The solution of question 1 is accepted if the plot reproduces the plot in the image and respect those criteria. The code below shows a solution.
The plot has to contain:
- the title
- the legend
###### Does it have a the title ?
###### Does it have a legend ?
![alt text][logo_ex7]

4
one_exercise_per_file/week01/day04/ex01/audit/readme.md

@ -1,6 +1,4 @@
## Correction
1. This question is validated if the outputted DataFrame is:
##### This question is validated if the outputted DataFrame is:
| | letter | number |
|---:|:---------|---------:|

6
one_exercise_per_file/week01/day04/ex02/audit/readme.md

@ -1,11 +1,13 @@
1. This question is validated if the output is:
##### The exercice is validated is all questions of the exercice are validated
##### The question 1 is validated if the output is:
| | id | Feature1_x | Feature2_x | Feature1_y | Feature2_y |
|---:|-----:|:-------------|:-------------|:-------------|:-------------|
| 0 | 1 | A | B | K | L |
| 1 | 2 | C | D | M | N |
2. This question is validated if the output is:
##### The question 2 is validated if the output is:
| | id | Feature1_df1 | Feature2_df1 | Feature1_df2 | Feature2_df2 |
|---:|-----:|:---------------|:---------------|:---------------|:---------------|

9
one_exercise_per_file/week01/day04/ex03/audit/readme.md

@ -1,4 +1,6 @@
1. This question is validated if the outputted DataFrame's shape is `(1305, 5)` and if `merged.head()` returns a similar table:
##### The exercice is validated is all questions of the exercice are validated
##### The question 1 is validated if the outputted DataFrame's shape is `(1305, 5)` and if `merged.head()` returns a table as below. One of the answers that returns the correct DataFrame is `market_data.merge(alternative_data, how='left', left_index=True, right_index=True)`
| | Open | Close | Close_Adjusted | Twitter | Reddit |
|:-----------------------------------------------------|-----------:|----------:|-----------------:|------------:|----------:|
@ -8,8 +10,5 @@
| (Timestamp('2021-01-01 00:00:00', freq='B'), 'AMZN') | 1.06324 | 0.841241 | -0.799481 | -0.805677 | 0.511769 |
| (Timestamp('2021-01-01 00:00:00', freq='B'), 'DAI') | -0.603453 | -2.06141 | -0.969064 | 1.49817 | 0.730055 |
One of the answers that returns the correct DataFrame is:
`market_data.merge(alternative_data, how='left', left_index=True, right_index=True)`
2. This question is validated if the numbers that are missing in the DataFrame are equal to 0 and if `filled_df.sum().sum() == merged_df.sum().sum()` gives: `True`
##### The question 2 is validated if the numbers that are missing in the DataFrame are equal to 0 and if `filled_df.sum().sum() == merged_df.sum().sum()` gives: `True`

39
one_exercise_per_file/week01/day04/ex04/audit/readme.md

@ -1,6 +1,6 @@
The for loop is forbidden in this exercise. The goal is to use `groupby` and `apply`.
##### The exercice is validated is all questions of the exercice are validated and if the for loop hasn't been used. The goal is to use `groupby` and `apply`.
1. This question is validated if the output is:
##### The question 1 is validated if the output is:
```python
df = pd.DataFrame(range(1,11), columns=['sequence'])
@ -20,7 +20,22 @@ The for loop is forbidden in this exercise. The goal is to use `groupby` and `ap
| 8 | 8.2 |
| 9 | 8.2 |
2. This question is validated if the output is the same as the one returned by:
##### The question 2 is validated if the output is a Pandas Series or DataFrame with the first 11 rows equal to the output below. The code below give a solution.
| | sequence |
|---:|-----------:|
| 0 | 1.45 |
| 1 | 2 |
| 2 | 3 |
| 3 | 4 |
| 4 | 5 |
| 5 | 6 |
| 6 | 7 |
| 7 | 8 |
| 8 | 9 |
| 9 | 9.55 |
| 10 | 11.45 |
```python
def winsorize(df_series, quantiles):
@ -38,22 +53,4 @@ The for loop is forbidden in this exercise. The goal is to use `groupby` and `ap
df.groupby("group")[['sequence']].apply(winsorize, [0.05,0.95])
```
The output can also be a Series instead of a DataFrame.
The expected output (first rows) is:
| | sequence |
|---:|-----------:|
| 0 | 1.45 |
| 1 | 2 |
| 2 | 3 |
| 3 | 4 |
| 4 | 5 |
| 5 | 6 |
| 6 | 7 |
| 7 | 8 |
| 8 | 9 |
| 9 | 9.55 |
| 10 | 11.45 |
- https://towardsdatascience.com/how-to-use-the-split-apply-combine-strategy-in-pandas-groupby-29e0eb44b62e

7
one_exercise_per_file/week01/day04/ex05/audit/readme.md

@ -1,11 +1,8 @@
1. The question is validated if the output is:
##### The question is validated if the output is as below. The columns don't have to be MultiIndex. A solution could be `df.groupby('product').agg({'value':['min','max','mean']})`
| product | ('value', 'min') | ('value', 'max') | ('value', 'mean') |
|:-------------|-------------------:|-------------------:|--------------------:|
| chair | 22.89 | 32.12 | 27.505 |
| mobile phone | 100 | 111.22 | 105.61 |
| table | 20.45 | 99.99 | 51.22 |
Note: The columns don't have to be MultiIndex
My answer is: `df.groupby('product').agg({'value':['min','max','mean']})`

7
one_exercise_per_file/week01/day04/ex06/audit/readme.md

@ -1,4 +1,7 @@
1. This questions is validated is the output is similar to `unstacked_df.head()`:
##### The exercice is validated is all questions of the exercice are validated
The question 1 is validated if the output is similar to what `unstacked_df.head()`returns:
| Date | ('Prediction', 'AAPL') | ('Prediction', 'AMZN') | ('Prediction', 'DAI') | ('Prediction', 'FB') | ('Prediction', 'GE') |
|:--------------------|-------------------------:|-------------------------:|------------------------:|-----------------------:|-----------------------:|
@ -6,4 +9,4 @@
| 2021-01-04 00:00:00 | -0.560953 | 0.503199 | -0.79517 | -3.23136 | 1.50271 |
| 2021-01-05 00:00:00 | 0.211489 | 1.84867 | 0.287906 | -1.81119 | 1.20321 |
2. The question is validated if the answer is: `unstacked.plot(title = 'Stocks 2021')`. The title can be anything else.
##### The question 2 is validated if the answer is: `unstacked.plot(title = 'Stocks 2021')`. The title can be anything else.

12
one_exercise_per_file/week01/day05/ex01/audit/readme.md

@ -1,4 +1,6 @@
1. This question is validated if the output of is
##### The exercice is validated is all questions of the exercice are validated
##### The question 1 is validated if the output of is as below. The best solution uses `pd.date_range` to generate the index and `range` to generate the integer series.
```console
2010-01-01 0
@ -15,9 +17,7 @@
Freq: D, Name: integer_series, Length: 4018, dtype: int64
```
The best solution uses `pd.date_range` to generate the index and `range` to generate the integer series.
2. This question is validated if the output is:
##### This question is validated if the output is as below. If the `NaN` values have been dropped the solution is also accepted. The solution uses `rolling().mean()`.
```console
2010-01-01 NaN
@ -32,6 +32,4 @@
2020-12-30 4013.0
2020-12-31 4014.0
Freq: D, Name: integer_series, Length: 4018, dtype: float64
```
If the `NaN` values have been dropped the solution is also accepted. The solution uses `rolling().mean()`.
```

23
one_exercise_per_file/week01/day05/ex02/audit/readme.md

@ -1,17 +1,14 @@
Preliminary:
##### The exercice is validated is all questions of the exercice are validated
- As usual the first steps are:
###### Have you checked missing values and data types ?
###### Have you converted string dates to datetime ?
###### Have you set dates as index ?
###### Have you used `info` or `describe` to have a first look at the data ?
- Check missing values and data types
- Convert string dates to datetime
- Set dates as index
- Use `info` or `describe` to have a first look at the data
The exercise is not validated if these steps have not been done.
##### The question 1 is validated if you inserted the right columns in `Candlestick` `Plotly` object. The Candlestick is based on Open, High, Low and Close columns. The index is Date (datetime).
1. The Candlestick is based on Open, High, Low and Close columns. The index is Date (datetime). As long as you inserted the right columns in `Candlestick` `Plotly` object you validate the question.
2. This question is validated if the output of `print(transformed_df.head().to_markdown())` is
##### This question 2 is validated if the output of `print(transformed_df.head().to_markdown())` is as below and if there are **482 months**.
| Date | Open | Close | Volume | High | Low |
|:--------------------|---------:|---------:|------------:|---------:|---------:|
@ -26,9 +23,8 @@ To get this result there are two ways: `resample` and `groupby`. There are two k
- Find how to affect the aggregation on the last **business** day of each month. This is already implemented in Pandas and the keyword that should be used either in `resample` parameter or in `Grouper` is `BM`.
- Choose the right aggregation function for each variable. The prices (Open, Close and Adjusted Close) should be aggregated by taking the `mean`. Low should be aggregated by taking the `minimum` because it represents the lower price of the day, so the lowest price on the month is the lowest price of the lowest prices on the day. The same logic applied to High, leads to use the `maximum` to aggregate the High. Volume should be aggregated using the `sum` because the monthly volume is equal to the sum of daily volume over the month.
There are **482 months**.
3. The solution is accepted if it doesn't involve a for loop and the output is:
##### The question 3 is validated if it doesn't involve a for loop and the output is as below. The first way to do it is to compute the return without for loop is to use `pct_change`. And the second way to do it is to implement the formula given in the exercise in a vectorized way. To get the value at `t-1` you can use `shift`.
```console
Date
@ -45,6 +41,3 @@ To get this result there are two ways: `resample` and `groupby`. There are two k
2021-01-29 -0.026448
Name: Open, Length: 10118, dtype: float64
```
- The first way is to compute the return without for loop is to use `pct_change`
- The second way to compute the return without for loop is to implement the formula given in the exercise in a vectorized way. To get the value at `t-1` you can use `shift`

3
one_exercise_per_file/week01/day05/ex03/audit/readme.md

@ -1,7 +1,6 @@
1. This question is validated if, without having used a for loop, the outputted DataFrame shape's `(261, 5)` and your output is the same as the one return with this line of code:
##### This question is validated if, without having used a for loop, the outputted DataFrame shape's `(261, 5)` and your output is the same as the one return with this line of code. The DataFrame contains random data. Make sure your output and the one returned by this code is based on the same DataFrame.
```python
market_data.loc[market_data.index.get_level_values('Ticker')=='AAPL'].sort_index().pct_change()
```
The DataFrame contains random data. Make sure your output and the one returned by this code is based on the same DataFrame.

29
one_exercise_per_file/week01/day05/ex04/audit/readme.md

@ -1,17 +1,15 @@
Preliminary:
##### The exercice is validated is all questions of the exercice are validated
- As usual the first steps are:
###### Have you checked missing values and data types ?
###### Have you converted string dates to datetime ?
###### Have you set dates as index ?
###### Have you used `info` or `describe` to have a first look at the data ?
- Check missing values and data types
- Convert string dates to datetime
- Set dates as index
- Use `info` or `describe` to have a first look at the data
The exercise is not validated if these steps haven't been done.
My results can be reproduced using: `np.random.seed = 2712`. Given the versions of NumPy used I do not guaranty the reproducibility of the results - that is why I also explain the steps to get to the solution.
**My results can be reproduced using: `np.random.seed = 2712`. Given the versions of NumPy used I do not guaranty the reproducibility of the results - that is why I also explain the steps to get to the solution.**
1. This question is validated if the return is computed as: Return(t) = (Price(t+1) - Price(t))/Price(t) and returns this output.
##### The question 1 is validated if the return is computed as: Return(t) = (Price(t+1) - Price(t))/Price(t) and returns this output. Note that if the index is not ordered in ascending order the futur return computed is wrong. The answer is also accepted if the returns is computed as in the exercise 2 and then shifted in the futur using `shift`, but I do not recommend this implementation as it adds missing values !
```console
Date
@ -29,8 +27,6 @@ My results can be reproduced using: `np.random.seed = 2712`. Given the versions
Name: Daily_futur_returns, Length: 10118, dtype: float64
```
The answer is also accepted if the returns is computed as in the exercise 2 and then shifted in the futur using `shift`, but I do not recommend this implementation as it adds missing values !
An example of solution is:
```python
@ -40,11 +36,10 @@ My results can be reproduced using: `np.random.seed = 2712`. Given the versions
compute_futur_return(df['Adj Close'])
```
Note that if the index is not ordered in ascending order the futur return computed is wrong.
2. This question is validated if the index of the Series is the same as the index of the DataFrame. The data of the series can be generated using `np.random.randint(0,2,len(df.index)`.
##### The question 2 is validated if the index of the Series is the same as the index of the DataFrame. The data of the series can be generated using `np.random.randint(0,2,len(df.index)`.
3. This question is validated if the Pnl is computed as: signal * futur_return. Both series should have the same index.
##### This question is validated if the Pnl is computed as: signal * futur_return. Both series should have the same index.
```console
Date
@ -62,8 +57,6 @@ My results can be reproduced using: `np.random.seed = 2712`. Given the versions
Name: PnL, Length: 10119, dtype: float64
```
4. The question is validated if you computed the return of the strategy as: `(Total earned - Total invested) / Total` invested. The result should be close to 0. The formula given could be simplified as `(PnLs.sum())/signal.sum()`.
My return is: 0.00043546984088551553 because I invested 5147$ and I earned 5149$.
##### The question 4 is validated if you computed the return of the strategy as: `(Total earned - Total invested) / Total` invested. The result should be close to 0. The formula given could be simplified as `(PnLs.sum())/signal.sum()`. My return is: 0.00043546984088551553 because I invested 5147$ and I earned 5149$.
5. The question is validated if you replaced the previous signal Series with 1s. Similarly as the previous question, we earned 10128$ and we invested 10118$ which leads to a return of 0.00112670194140969 (0.1%).
##### The question is validated if you replaced the previous signal Series with 1s. Similarly as the previous question, we earned 10128$ and we invested 10118$ which leads to a return of 0.00112670194140969 (0.1%).
Loading…
Cancel
Save