day4: testing and feedback

3 years ago · d8bda94310
1 changed files with 64 additions and 81 deletions
--- a/one_md_per_day_format/piscine/Week1/day4.md
+++ b/one_md_per_day_format/piscine/Week1/day4.md
@ -1,29 +1,28 @@
-# D04  Piscine AI - Data Science 
+# D04  Piscine AI - Data Science

-
-Author: 
+Author:

 # Table of Contents:
-Historical part: 

-Data wrangling, unify source of data ... 
-# Introduction
+Historical part:

+Data wrangling, unify source of data ...

+# Introduction

-... 
-## Ressources 
-Pandas website
- https://jakevdp.github.io/PythonDataScienceHandbook/
+...

-https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
+## Resources

+Pandas website

-https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/
+- https://jakevdp.github.io/PythonDataScienceHandbook/

-https://towardsdatascience.com/different-ways-to-iterate-over-rows-in-a-pandas-dataframe-performance-comparison-dc0d5dcef8fe
+- https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

+- https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/

+- https://towardsdatascience.com/different-ways-to-iterate-over-rows-in-a-pandas-dataframe-performance-comparison-dc0d5dcef8fe

 # Exercise 1 Concatenate

@ -31,18 +30,15 @@ The goal of this exercise is to learn to concatenate DataFrames. The logic is th

 Here are the two DataFrames to concatenate:

-
-```
+```python
 df1 = pd.DataFrame([['a', 1], ['b', 2]],
                   columns=['letter', 'number'])
 df2 = pd.DataFrame([['c', 1], ['d', 2]],
                   columns=['letter', 'number'])
-
 ```

 1. Concatenate this two DataFrames on index axis and reset the index. The index of the outputted should be `RangeIndex(start=0, stop=4, step=1)`. **Do not change the index manually**.

-
 ## Correction

 1. This question is validated if the outputted DataFrame is:
@ -54,15 +50,14 @@ df2 = pd.DataFrame([['c', 1], ['d', 2]],
    |  2 | c        |        1 |
    |  3 | d        |        2 |

-
 # Exercise 2 Merge

 The goal of this exercise is to learn to merge DataFrames
-The logic of merging DataFrames in Pandas is quite similar as the one used in SQL. 
+The logic of merging DataFrames in Pandas is quite similar as the one used in SQL.

 Here are the two DataFrames to merge:

-```
+```python
 #df1

 df1_dict = {
@ -80,6 +75,7 @@ df2_dict = {

 df2 = pd.DataFrame(df2_dict, columns = ['id', 'Feature1', 'Feature2'])
 ```
+
 1. Merge the two DataFrames to get this output:

    |    |   id | Feature1_x   | Feature2_x   | Feature1_y   | Feature2_y   |
@ -87,7 +83,7 @@ df2 = pd.DataFrame(df2_dict, columns = ['id', 'Feature1', 'Feature2'])
    |  0 |    1 | A            | B            | K            | L            |
    |  1 |    2 | C            | D            | M            | N            |

-2. Merge the two DataFrames to get this output: 
+2. Merge the two DataFrames to get this output:

    |    |   id | Feature1_df1   | Feature2_df1   | Feature1_df2   | Feature2_df2   |
    |---:|-----:|:---------------|:---------------|:---------------|:---------------|
@ -100,16 +96,16 @@ df2 = pd.DataFrame(df2_dict, columns = ['id', 'Feature1', 'Feature2'])
    |  6 |    7 | nan            | nan            | Q              | R              |
    |  7 |    8 | nan            | nan            | S              | T              |

-## Correction 
+## Corrections

-1. This question is validated if the output is: 
+1. This question is validated if the output is:

    |    |   id | Feature1_x   | Feature2_x   | Feature1_y   | Feature2_y   |
    |---:|-----:|:-------------|:-------------|:-------------|:-------------|
    |  0 |    1 | A            | B            | K            | L            |
    |  1 |    2 | C            | D            | M            | N            |

-2. This question is validated if the output is: 
+2. This question is validated if the output is:

    |    |   id | Feature1_df1   | Feature2_df1   | Feature1_df2   | Feature2_df2   |
    |---:|-----:|:---------------|:---------------|:---------------|:---------------|
@ -122,17 +118,16 @@ df2 = pd.DataFrame(df2_dict, columns = ['id', 'Feature1', 'Feature2'])
    |  6 |    7 | nan            | nan            | Q              | R              |
    |  7 |    8 | nan            | nan            | S              | T              |

-    Note: Check that the suffixes are set using the suffix parameters rather than manually changing the columns' name. 
-
+    Note: Check that the suffixes are set using the suffix parameters rather than manually changing the columns' name.

 ## Exercise 3 Merge MultiIndex

-The goal of this exercise is to learn to merge DataFrames with MultiIndex. 
-Use the code below to generate the DataFrames. `market_data` contains fake market data. In finance, the market is available during the trading days (business days). `alternative_data` contains fake alternative data from social media. This data is available every day. But, for some reasons the Data Engineer lost the last 15 days of alternative data. 
+The goal of this exercise is to learn to merge DataFrames with MultiIndex.
+Use the code below to generate the DataFrames. `market_data` contains fake market data. In finance, the market is available during the trading days (business days). `alternative_data` contains fake alternative data from social media. This data is available every day. But, for some reasons the Data Engineer lost the last 15 days of alternative data.

 1. Using `market_data` as the reference, merge `alternative_data` on `market_data`

-    ```
+    ```python
    #generate days
    all_dates = pd.date_range('2021-01-01', '2021-12-15')
    business_dates = pd.bdate_range('2021-01-01', '2021-12-31')
@ -152,20 +147,17 @@ Use the code below to generate the DataFrames. `market_data` contains fake marke
    alternative_data = pd.DataFrame(index=index_alt,
                                    data=np.random.randn(len(index_alt), 2),
                                    columns=['Twitter','Reddit'])
-
    ```

 `reset_index` is not allowed for this question

-2. Fill missing values with 0 
-
-https://medium.com/swlh/merging-dataframes-with-pandas-pd-merge-7764c7e2d46d
+2. Fill missing values with 0

+- https://medium.com/swlh/merging-dataframes-with-pandas-pd-merge-7764c7e2d46d

+## Correction

-## Correction 
-
-1. This question is validated if the outputted DataFrame's shape is `(1305, 5)` and if `merged.head()` returns:
+1. This question is validated if the outputted DataFrame's shape is `(1305, 5)` and if `merged.head()` returns a similar table:

 |                                                      |       Open |     Close |   Close_Adjusted |     Twitter |    Reddit |
 |:-----------------------------------------------------|-----------:|----------:|-----------------:|------------:|----------:|
@ -175,23 +167,21 @@ https://medium.com/swlh/merging-dataframes-with-pandas-pd-merge-7764c7e2d46d
 | (Timestamp('2021-01-01 00:00:00', freq='B'), 'AMZN') |  1.06324   |  0.841241 |        -0.799481 | -0.805677   |  0.511769 |
 | (Timestamp('2021-01-01 00:00:00', freq='B'), 'DAI')  | -0.603453  | -2.06141  |        -0.969064 |  1.49817    |  0.730055 |

-One of the answers that returns the correct DataFrame is: 
+One of the answers that returns the correct DataFrame is:

 `market_data.merge(alternative_data, how='left', left_index=True, right_index=True)`

-2. This question is validated if the number of missing in the DataFrame is equal to 0 and if `filled_df.sum().sum() == merged_df.sum().sum()` gives: `True`
-
+2. This question is validated if the numbers that are missing in the DataFrame are equal to 0 and if `filled_df.sum().sum() == merged_df.sum().sum()` gives: `True`

 # Exercise 4 Groupby Apply

-The goal of this exercise is to learn to group the data and apply a function on the groups. 
-The use case we will work on is computing 
+The goal of this exercise is to learn to group the data and apply a function on the groups.
+The use case we will work on is computing

-1. Create a function that uses `pandas.DataFrame.clip` and that replace extreme values by a given percentile. The values that are greater than the upper percentile 80% are replaced by the percentile 80%. The values that are smaller than the lower percentile 20% are replaced by the percentile 20%. This process that correct outliers is called **winsorizing**. 
-I recommend to use NumPy to compute the percentiles to make sure we used the same defaut parameters. 
+1. Create a function that uses `pandas.DataFrame.clip` and that replace extreme values by a given percentile. The values that are greater than the upper percentile 80% are replaced by the percentile 80%. The values that are smaller than the lower percentile 20% are replaced by the percentile 20%. This process that correct outliers is called **winsorizing**.
+I recommend to use NumPy to compute the percentiles to make sure we used the same default parameters.

-
-    ```
+    ```python
        def winsorize(df, quantiles):
            """
                df: pd.DataFrame
@ -201,15 +191,15 @@ I recommend to use NumPy to compute the percentiles to make sure we used the sam
            #TODO
            return 
    ```
-    Here is what the function should output: 

-    ```
+    Here is what the function should output:
+
+    ```python
        df = pd.DataFrame(range(1,11), columns=['sequence'])
        print(winsorize(df, [0.20, 0.80]).to_markdown())

    ```

-
    |    |   sequence |
    |---:|-----------:|
    |  0 |        2.8 |
@ -223,16 +213,16 @@ I recommend to use NumPy to compute the percentiles to make sure we used the sam
    |  8 |        8.2 |
    |  9 |        8.2 |

+2. Now we consider that each value belongs to a group. The goal is to apply the **winsorizing to each group**. In this question we use winsorizing values that are common: `[0.05,0.95]` as percentiles. Here is the new data set:

-2. Now we consider that each value belongs to a group. The goal is to apply the **winsorizing to each group**. In this question we use winsorizing values that are common: `[0.05,0.95]` as percentiles. Here is the new data set: 
-
-    ```
+    ```python
    groups = np.concatenate([np.ones(10), np.ones(10)+1,  np.ones(10)+2, np.ones(10)+3, np.ones(10)+4])
    
    df = pd.DataFrame(data= zip(groups,
                                range(1,51)),
                    columns=["group", "sequence"])
    ```
+
    The expected output (first rows) is:

    |    |   sequence |
@ -249,19 +239,17 @@ I recommend to use NumPy to compute the percentiles to make sure we used the sam
    |  9 |       9.55 |
    | 10 |      11.45 |

-
 ## Correction
-The for loop is forbidden in this exercise. The goal is to use `groupby` and `apply`. 

-1.  This question is validated if the output is:
+The for loop is forbidden in this exercise. The goal is to use `groupby` and `apply`.

-    ```
+1. This question is validated if the output is:
+
+    ```python
        df = pd.DataFrame(range(1,11), columns=['sequence'])
        print(winsorize(df, [0.20, 0.80]).to_markdown())
-
    ```

-
    |    |   sequence |
    |---:|-----------:|
    |  0 |        2.8 |
@ -275,10 +263,9 @@ The for loop is forbidden in this exercise. The goal is to use `groupby` and `ap
    |  8 |        8.2 |
    |  9 |        8.2 |

+2. This question is validated if the output is the same as the one returned by:

-2. This question is validated if the output is the same as the one returned by: 
-
-    ```
+    ```python
    def winsorize(df_series, quantiles):
    """
        df: pd.DataFrame or pd.Series
@ -293,7 +280,8 @@ The for loop is forbidden in this exercise. The goal is to use `groupby` and `ap

    df.groupby("group")[['sequence']].apply(winsorize, [0.05,0.95])
    ```
-    The ouput can also be a Series instead of a DataFrame.
+
+    The output can also be a Series instead of a DataFrame.

    The expected output (first rows) is:

@ -309,15 +297,13 @@ The for loop is forbidden in this exercise. The goal is to use `groupby` and `ap
    |  7 |       8    |
    |  8 |       9    |
    |  9 |       9.55 |
-    | 10 |      11.45 | 
-
-https://towardsdatascience.com/how-to-use-the-split-apply-combine-strategy-in-pandas-groupby-29e0eb44b62e
-
+    | 10 |      11.45 |

+- https://towardsdatascience.com/how-to-use-the-split-apply-combine-strategy-in-pandas-groupby-29e0eb44b62e

 # Exercise 5 Groupby Agg

-The goal of this exercise is to learn to compute different type of agregations on the groups. This small DataFrame contains products and prices. 
+The goal of this exercise is to learn to compute different type of aggregations on the groups. This small DataFrame contains products and prices.

 |    |   value | product      |
 |---:|--------:|:-------------|
@ -329,7 +315,7 @@ The goal of this exercise is to learn to compute different type of agregations o
 |  5 |  100    | mobile phone |
 |  6 |   99.99 | table        |

-1. Compute the min, max and mean price for each product in one single line of code. The expected output is: 
+1. Compute the min, max and mean price for each product in one single line of code. The expected output is:

 | product      |   ('value', 'min') |   ('value', 'max') |   ('value', 'mean') |
 |:-------------|-------------------:|-------------------:|--------------------:|
@ -341,7 +327,7 @@ Note: The columns don't have to be MultiIndex

 ## Correction

-1. The question is validated if the output is: 
+1. The question is validated if the output is:

 | product      |   ('value', 'min') |   ('value', 'max') |   ('value', 'mean') |
 |:-------------|-------------------:|-------------------:|--------------------:|
@ -353,12 +339,12 @@ Note: The columns don't have to be MultiIndex

 My answer is: `df.groupby('product').agg({'value':['min','max','mean']})`

-# Exercise 6 Unstack 
+# Exercise 6 Unstack

-The goal of this exercise is to learn to unstack a MultiIndex. 
-Let's assume we trained a machine learning model that predicts a daily score on the companies (tickers) below. It may be very useful to unstack the MultiIndex: plot the time series, vectorize the backtest etc ... 
+The goal of this exercise is to learn to unstack a MultiIndex
+Let's assume we trained a machine learning model that predicts a daily score on the companies (tickers) below. It may be very useful to unstack the MultiIndex: plot the time series, vectorize the backtest, ...

-```
+```python
 business_dates = pd.bdate_range('2021-01-01', '2021-12-31')

 #generate tickers
@ -373,7 +359,8 @@ market_data = pd.DataFrame(index=index,
                        columns=['Prediction'])

 ```
-1. Unstack the DataFrame. 
+
+1. Unstack the DataFrame.

 The first 3 rows of the DataFrame should like this:

@ -383,13 +370,11 @@ The first 3 rows of the DataFrame should like this:
 | 2021-01-04 00:00:00 |                -0.560953 |                 0.503199 |               -0.79517  |             -3.23136   |                1.50271 |
 | 2021-01-05 00:00:00 |                 0.211489 |                 1.84867  |                0.287906 |             -1.81119   |                1.20321 |

+2. Plot the 5 times series in the same plot using Pandas built-in visualization functions with a title.

-2. Plot the 5 times series in the same plot using Pandas built-in visualisation functions with a title. 
-
-
-## Correction 
+## Correction

-1. This questions is validated is the output of `unstacked_df.head()` is 
+1. This questions is validated is the output is similar to `unstacked_df.head()`:

    | Date                |   ('Prediction', 'AAPL') |   ('Prediction', 'AMZN') |   ('Prediction', 'DAI') |   ('Prediction', 'FB') |   ('Prediction', 'GE') |
    |:--------------------|-------------------------:|-------------------------:|------------------------:|-----------------------:|-----------------------:|
@ -397,6 +382,4 @@ The first 3 rows of the DataFrame should like this:
    | 2021-01-04 00:00:00 |                -0.560953 |                 0.503199 |               -0.79517  |             -3.23136   |                1.50271 |
    | 2021-01-05 00:00:00 |                 0.211489 |                 1.84867  |                0.287906 |             -1.81119   |                1.20321 |

-2. The question is validated if the answer is: `unstacked.plot(title = 'Stocks 2021')`. The title can be anything else. 
-
-
+2. The question is validated if the answer is: `unstacked.plot(title = 'Stocks 2021')`. The title can be anything else.