diff --git a/one_md_per_day_format/piscine/Week1/day4.md b/one_md_per_day_format/piscine/Week1/day4.md index 2ff98d9..22d9b78 100644 --- a/one_md_per_day_format/piscine/Week1/day4.md +++ b/one_md_per_day_format/piscine/Week1/day4.md @@ -1,29 +1,28 @@ -# D04 Piscine AI - Data Science +# D04 Piscine AI - Data Science - -Author: +Author: # Table of Contents: -Historical part: -Data wrangling, unify source of data ... -# Introduction +Historical part: +Data wrangling, unify source of data ... +# Introduction -... -## Ressources -Pandas website -- https://jakevdp.github.io/PythonDataScienceHandbook/ +... -https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf +## Resources +Pandas website -https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/ +- https://jakevdp.github.io/PythonDataScienceHandbook/ -https://towardsdatascience.com/different-ways-to-iterate-over-rows-in-a-pandas-dataframe-performance-comparison-dc0d5dcef8fe +- https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf +- https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/ +- https://towardsdatascience.com/different-ways-to-iterate-over-rows-in-a-pandas-dataframe-performance-comparison-dc0d5dcef8fe # Exercise 1 Concatenate @@ -31,18 +30,15 @@ The goal of this exercise is to learn to concatenate DataFrames. The logic is th Here are the two DataFrames to concatenate: - -``` +```python df1 = pd.DataFrame([['a', 1], ['b', 2]], columns=['letter', 'number']) df2 = pd.DataFrame([['c', 1], ['d', 2]], columns=['letter', 'number']) - ``` 1. Concatenate this two DataFrames on index axis and reset the index. The index of the outputted should be `RangeIndex(start=0, stop=4, step=1)`. **Do not change the index manually**. - ## Correction 1. This question is validated if the outputted DataFrame is: @@ -54,15 +50,14 @@ df2 = pd.DataFrame([['c', 1], ['d', 2]], | 2 | c | 1 | | 3 | d | 2 | - # Exercise 2 Merge The goal of this exercise is to learn to merge DataFrames -The logic of merging DataFrames in Pandas is quite similar as the one used in SQL. +The logic of merging DataFrames in Pandas is quite similar as the one used in SQL. Here are the two DataFrames to merge: -``` +```python #df1 df1_dict = { @@ -80,6 +75,7 @@ df2_dict = { df2 = pd.DataFrame(df2_dict, columns = ['id', 'Feature1', 'Feature2']) ``` + 1. Merge the two DataFrames to get this output: | | id | Feature1_x | Feature2_x | Feature1_y | Feature2_y | @@ -87,7 +83,7 @@ df2 = pd.DataFrame(df2_dict, columns = ['id', 'Feature1', 'Feature2']) | 0 | 1 | A | B | K | L | | 1 | 2 | C | D | M | N | -2. Merge the two DataFrames to get this output: +2. Merge the two DataFrames to get this output: | | id | Feature1_df1 | Feature2_df1 | Feature1_df2 | Feature2_df2 | |---:|-----:|:---------------|:---------------|:---------------|:---------------| @@ -100,16 +96,16 @@ df2 = pd.DataFrame(df2_dict, columns = ['id', 'Feature1', 'Feature2']) | 6 | 7 | nan | nan | Q | R | | 7 | 8 | nan | nan | S | T | -## Correction +## Corrections -1. This question is validated if the output is: +1. This question is validated if the output is: | | id | Feature1_x | Feature2_x | Feature1_y | Feature2_y | |---:|-----:|:-------------|:-------------|:-------------|:-------------| | 0 | 1 | A | B | K | L | | 1 | 2 | C | D | M | N | -2. This question is validated if the output is: +2. This question is validated if the output is: | | id | Feature1_df1 | Feature2_df1 | Feature1_df2 | Feature2_df2 | |---:|-----:|:---------------|:---------------|:---------------|:---------------| @@ -122,17 +118,16 @@ df2 = pd.DataFrame(df2_dict, columns = ['id', 'Feature1', 'Feature2']) | 6 | 7 | nan | nan | Q | R | | 7 | 8 | nan | nan | S | T | - Note: Check that the suffixes are set using the suffix parameters rather than manually changing the columns' name. - + Note: Check that the suffixes are set using the suffix parameters rather than manually changing the columns' name. ## Exercise 3 Merge MultiIndex -The goal of this exercise is to learn to merge DataFrames with MultiIndex. -Use the code below to generate the DataFrames. `market_data` contains fake market data. In finance, the market is available during the trading days (business days). `alternative_data` contains fake alternative data from social media. This data is available every day. But, for some reasons the Data Engineer lost the last 15 days of alternative data. +The goal of this exercise is to learn to merge DataFrames with MultiIndex. +Use the code below to generate the DataFrames. `market_data` contains fake market data. In finance, the market is available during the trading days (business days). `alternative_data` contains fake alternative data from social media. This data is available every day. But, for some reasons the Data Engineer lost the last 15 days of alternative data. 1. Using `market_data` as the reference, merge `alternative_data` on `market_data` - ``` + ```python #generate days all_dates = pd.date_range('2021-01-01', '2021-12-15') business_dates = pd.bdate_range('2021-01-01', '2021-12-31') @@ -152,20 +147,17 @@ Use the code below to generate the DataFrames. `market_data` contains fake marke alternative_data = pd.DataFrame(index=index_alt, data=np.random.randn(len(index_alt), 2), columns=['Twitter','Reddit']) - ``` `reset_index` is not allowed for this question -2. Fill missing values with 0 - -https://medium.com/swlh/merging-dataframes-with-pandas-pd-merge-7764c7e2d46d +2. Fill missing values with 0 +- https://medium.com/swlh/merging-dataframes-with-pandas-pd-merge-7764c7e2d46d +## Correction -## Correction - -1. This question is validated if the outputted DataFrame's shape is `(1305, 5)` and if `merged.head()` returns: +1. This question is validated if the outputted DataFrame's shape is `(1305, 5)` and if `merged.head()` returns a similar table: | | Open | Close | Close_Adjusted | Twitter | Reddit | |:-----------------------------------------------------|-----------:|----------:|-----------------:|------------:|----------:| @@ -175,23 +167,21 @@ https://medium.com/swlh/merging-dataframes-with-pandas-pd-merge-7764c7e2d46d | (Timestamp('2021-01-01 00:00:00', freq='B'), 'AMZN') | 1.06324 | 0.841241 | -0.799481 | -0.805677 | 0.511769 | | (Timestamp('2021-01-01 00:00:00', freq='B'), 'DAI') | -0.603453 | -2.06141 | -0.969064 | 1.49817 | 0.730055 | -One of the answers that returns the correct DataFrame is: +One of the answers that returns the correct DataFrame is: `market_data.merge(alternative_data, how='left', left_index=True, right_index=True)` -2. This question is validated if the number of missing in the DataFrame is equal to 0 and if `filled_df.sum().sum() == merged_df.sum().sum()` gives: `True` - +2. This question is validated if the numbers that are missing in the DataFrame are equal to 0 and if `filled_df.sum().sum() == merged_df.sum().sum()` gives: `True` # Exercise 4 Groupby Apply -The goal of this exercise is to learn to group the data and apply a function on the groups. -The use case we will work on is computing +The goal of this exercise is to learn to group the data and apply a function on the groups. +The use case we will work on is computing -1. Create a function that uses `pandas.DataFrame.clip` and that replace extreme values by a given percentile. The values that are greater than the upper percentile 80% are replaced by the percentile 80%. The values that are smaller than the lower percentile 20% are replaced by the percentile 20%. This process that correct outliers is called **winsorizing**. -I recommend to use NumPy to compute the percentiles to make sure we used the same defaut parameters. +1. Create a function that uses `pandas.DataFrame.clip` and that replace extreme values by a given percentile. The values that are greater than the upper percentile 80% are replaced by the percentile 80%. The values that are smaller than the lower percentile 20% are replaced by the percentile 20%. This process that correct outliers is called **winsorizing**. +I recommend to use NumPy to compute the percentiles to make sure we used the same default parameters. - - ``` + ```python def winsorize(df, quantiles): """ df: pd.DataFrame @@ -201,15 +191,15 @@ I recommend to use NumPy to compute the percentiles to make sure we used the sam #TODO return ``` - Here is what the function should output: - ``` + Here is what the function should output: + + ```python df = pd.DataFrame(range(1,11), columns=['sequence']) print(winsorize(df, [0.20, 0.80]).to_markdown()) ``` - | | sequence | |---:|-----------:| | 0 | 2.8 | @@ -223,16 +213,16 @@ I recommend to use NumPy to compute the percentiles to make sure we used the sam | 8 | 8.2 | | 9 | 8.2 | +2. Now we consider that each value belongs to a group. The goal is to apply the **winsorizing to each group**. In this question we use winsorizing values that are common: `[0.05,0.95]` as percentiles. Here is the new data set: -2. Now we consider that each value belongs to a group. The goal is to apply the **winsorizing to each group**. In this question we use winsorizing values that are common: `[0.05,0.95]` as percentiles. Here is the new data set: - - ``` + ```python groups = np.concatenate([np.ones(10), np.ones(10)+1, np.ones(10)+2, np.ones(10)+3, np.ones(10)+4]) df = pd.DataFrame(data= zip(groups, range(1,51)), columns=["group", "sequence"]) ``` + The expected output (first rows) is: | | sequence | @@ -249,19 +239,17 @@ I recommend to use NumPy to compute the percentiles to make sure we used the sam | 9 | 9.55 | | 10 | 11.45 | - ## Correction -The for loop is forbidden in this exercise. The goal is to use `groupby` and `apply`. -1. This question is validated if the output is: +The for loop is forbidden in this exercise. The goal is to use `groupby` and `apply`. - ``` +1. This question is validated if the output is: + + ```python df = pd.DataFrame(range(1,11), columns=['sequence']) print(winsorize(df, [0.20, 0.80]).to_markdown()) - ``` - | | sequence | |---:|-----------:| | 0 | 2.8 | @@ -275,10 +263,9 @@ The for loop is forbidden in this exercise. The goal is to use `groupby` and `ap | 8 | 8.2 | | 9 | 8.2 | +2. This question is validated if the output is the same as the one returned by: -2. This question is validated if the output is the same as the one returned by: - - ``` + ```python def winsorize(df_series, quantiles): """ df: pd.DataFrame or pd.Series @@ -293,7 +280,8 @@ The for loop is forbidden in this exercise. The goal is to use `groupby` and `ap df.groupby("group")[['sequence']].apply(winsorize, [0.05,0.95]) ``` - The ouput can also be a Series instead of a DataFrame. + + The output can also be a Series instead of a DataFrame. The expected output (first rows) is: @@ -309,15 +297,13 @@ The for loop is forbidden in this exercise. The goal is to use `groupby` and `ap | 7 | 8 | | 8 | 9 | | 9 | 9.55 | - | 10 | 11.45 | - -https://towardsdatascience.com/how-to-use-the-split-apply-combine-strategy-in-pandas-groupby-29e0eb44b62e - + | 10 | 11.45 | +- https://towardsdatascience.com/how-to-use-the-split-apply-combine-strategy-in-pandas-groupby-29e0eb44b62e # Exercise 5 Groupby Agg -The goal of this exercise is to learn to compute different type of agregations on the groups. This small DataFrame contains products and prices. +The goal of this exercise is to learn to compute different type of aggregations on the groups. This small DataFrame contains products and prices. | | value | product | |---:|--------:|:-------------| @@ -329,7 +315,7 @@ The goal of this exercise is to learn to compute different type of agregations o | 5 | 100 | mobile phone | | 6 | 99.99 | table | -1. Compute the min, max and mean price for each product in one single line of code. The expected output is: +1. Compute the min, max and mean price for each product in one single line of code. The expected output is: | product | ('value', 'min') | ('value', 'max') | ('value', 'mean') | |:-------------|-------------------:|-------------------:|--------------------:| @@ -341,7 +327,7 @@ Note: The columns don't have to be MultiIndex ## Correction -1. The question is validated if the output is: +1. The question is validated if the output is: | product | ('value', 'min') | ('value', 'max') | ('value', 'mean') | |:-------------|-------------------:|-------------------:|--------------------:| @@ -353,12 +339,12 @@ Note: The columns don't have to be MultiIndex My answer is: `df.groupby('product').agg({'value':['min','max','mean']})` -# Exercise 6 Unstack +# Exercise 6 Unstack -The goal of this exercise is to learn to unstack a MultiIndex. -Let's assume we trained a machine learning model that predicts a daily score on the companies (tickers) below. It may be very useful to unstack the MultiIndex: plot the time series, vectorize the backtest etc ... +The goal of this exercise is to learn to unstack a MultiIndex +Let's assume we trained a machine learning model that predicts a daily score on the companies (tickers) below. It may be very useful to unstack the MultiIndex: plot the time series, vectorize the backtest, ... -``` +```python business_dates = pd.bdate_range('2021-01-01', '2021-12-31') #generate tickers @@ -373,7 +359,8 @@ market_data = pd.DataFrame(index=index, columns=['Prediction']) ``` -1. Unstack the DataFrame. + +1. Unstack the DataFrame. The first 3 rows of the DataFrame should like this: @@ -383,13 +370,11 @@ The first 3 rows of the DataFrame should like this: | 2021-01-04 00:00:00 | -0.560953 | 0.503199 | -0.79517 | -3.23136 | 1.50271 | | 2021-01-05 00:00:00 | 0.211489 | 1.84867 | 0.287906 | -1.81119 | 1.20321 | +2. Plot the 5 times series in the same plot using Pandas built-in visualization functions with a title. -2. Plot the 5 times series in the same plot using Pandas built-in visualisation functions with a title. - - -## Correction +## Correction -1. This questions is validated is the output of `unstacked_df.head()` is +1. This questions is validated is the output is similar to `unstacked_df.head()`: | Date | ('Prediction', 'AAPL') | ('Prediction', 'AMZN') | ('Prediction', 'DAI') | ('Prediction', 'FB') | ('Prediction', 'GE') | |:--------------------|-------------------------:|-------------------------:|------------------------:|-----------------------:|-----------------------:| @@ -397,6 +382,4 @@ The first 3 rows of the DataFrame should like this: | 2021-01-04 00:00:00 | -0.560953 | 0.503199 | -0.79517 | -3.23136 | 1.50271 | | 2021-01-05 00:00:00 | 0.211489 | 1.84867 | 0.287906 | -1.81119 | 1.20321 | -2. The question is validated if the answer is: `unstacked.plot(title = 'Stocks 2021')`. The title can be anything else. - - +2. The question is validated if the answer is: `unstacked.plot(title = 'Stocks 2021')`. The title can be anything else.