From 9e8fe16f0c58defa91ffe451da95d4c40d1da62e Mon Sep 17 00:00:00 2001 From: Badr Ghazlane Date: Sat, 25 Sep 2021 17:43:54 +0200 Subject: [PATCH] feat: structure exercises of day04 --- .../day04/ex01/audit/readme.md | 10 +++ one_exercise_per_file/day04/ex01/readme.md | 14 ++++ .../day04/ex02/audit/readme.md | 21 ++++++ one_exercise_per_file/day04/ex02/readme.md | 46 +++++++++++++ .../day04/ex03/audit/readme.md | 15 +++++ one_exercise_per_file/day04/ex03/readme.md | 34 ++++++++++ .../day04/ex04/audit/readme.md | 59 +++++++++++++++++ one_exercise_per_file/day04/ex04/readme.md | 65 +++++++++++++++++++ .../day04/ex05/audit/readme.md | 11 ++++ one_exercise_per_file/day04/ex05/readme.md | 23 +++++++ .../day04/ex06/audit/readme.md | 9 +++ one_exercise_per_file/day04/ex06/readme.md | 32 +++++++++ one_exercise_per_file/day04/readme.md | 25 +++++++ 13 files changed, 364 insertions(+) create mode 100644 one_exercise_per_file/day04/ex01/audit/readme.md create mode 100644 one_exercise_per_file/day04/ex01/readme.md create mode 100644 one_exercise_per_file/day04/ex02/audit/readme.md create mode 100644 one_exercise_per_file/day04/ex02/readme.md create mode 100644 one_exercise_per_file/day04/ex03/audit/readme.md create mode 100644 one_exercise_per_file/day04/ex03/readme.md create mode 100644 one_exercise_per_file/day04/ex04/audit/readme.md create mode 100644 one_exercise_per_file/day04/ex04/readme.md create mode 100644 one_exercise_per_file/day04/ex05/audit/readme.md create mode 100644 one_exercise_per_file/day04/ex05/readme.md create mode 100644 one_exercise_per_file/day04/ex06/audit/readme.md create mode 100644 one_exercise_per_file/day04/ex06/readme.md create mode 100644 one_exercise_per_file/day04/readme.md diff --git a/one_exercise_per_file/day04/ex01/audit/readme.md b/one_exercise_per_file/day04/ex01/audit/readme.md new file mode 100644 index 0000000..7e17988 --- /dev/null +++ b/one_exercise_per_file/day04/ex01/audit/readme.md @@ -0,0 +1,10 @@ +## Correction + +1. This question is validated if the outputted DataFrame is: + + | | letter | number | + |---:|:---------|---------:| + | 0 | a | 1 | + | 1 | b | 2 | + | 2 | c | 1 | + | 3 | d | 2 | diff --git a/one_exercise_per_file/day04/ex01/readme.md b/one_exercise_per_file/day04/ex01/readme.md new file mode 100644 index 0000000..69ca9bc --- /dev/null +++ b/one_exercise_per_file/day04/ex01/readme.md @@ -0,0 +1,14 @@ +# Exercise 1 Concatenate + +The goal of this exercise is to learn to concatenate DataFrames. The logic is the same for the Series. + +Here are the two DataFrames to concatenate: + +```python +df1 = pd.DataFrame([['a', 1], ['b', 2]], + columns=['letter', 'number']) +df2 = pd.DataFrame([['c', 1], ['d', 2]], + columns=['letter', 'number']) +``` + +1. Concatenate this two DataFrames on index axis and reset the index. The index of the outputted should be `RangeIndex(start=0, stop=4, step=1)`. **Do not change the index manually**. diff --git a/one_exercise_per_file/day04/ex02/audit/readme.md b/one_exercise_per_file/day04/ex02/audit/readme.md new file mode 100644 index 0000000..b0fae88 --- /dev/null +++ b/one_exercise_per_file/day04/ex02/audit/readme.md @@ -0,0 +1,21 @@ +1. This question is validated if the output is: + + | | id | Feature1_x | Feature2_x | Feature1_y | Feature2_y | + |---:|-----:|:-------------|:-------------|:-------------|:-------------| + | 0 | 1 | A | B | K | L | + | 1 | 2 | C | D | M | N | + +2. This question is validated if the output is: + + | | id | Feature1_df1 | Feature2_df1 | Feature1_df2 | Feature2_df2 | + |---:|-----:|:---------------|:---------------|:---------------|:---------------| + | 0 | 1 | A | B | K | L | + | 1 | 2 | C | D | M | N | + | 2 | 3 | E | F | nan | nan | + | 3 | 4 | G | H | nan | nan | + | 4 | 5 | I | J | nan | nan | + | 5 | 6 | nan | nan | O | P | + | 6 | 7 | nan | nan | Q | R | + | 7 | 8 | nan | nan | S | T | + + Note: Check that the suffixes are set using the suffix parameters rather than manually changing the columns' name. diff --git a/one_exercise_per_file/day04/ex02/readme.md b/one_exercise_per_file/day04/ex02/readme.md new file mode 100644 index 0000000..3871671 --- /dev/null +++ b/one_exercise_per_file/day04/ex02/readme.md @@ -0,0 +1,46 @@ +# Exercise 2 Merge + +The goal of this exercise is to learn to merge DataFrames +The logic of merging DataFrames in Pandas is quite similar as the one used in SQL. + +Here are the two DataFrames to merge: + +```python +#df1 + +df1_dict = { + 'id': ['1', '2', '3', '4', '5'], + 'Feature1': ['A', 'C', 'E', 'G', 'I'], + 'Feature2': ['B', 'D', 'F', 'H', 'J']} + +df1 = pd.DataFrame(df1_dict, columns = ['id', 'Feature1', 'Feature2']) + +#df2 +df2_dict = { + 'id': ['1', '2', '6', '7', '8'], + 'Feature1': ['K', 'M', 'O', 'Q', 'S'], + 'Feature2': ['L', 'N', 'P', 'R', 'T']} + +df2 = pd.DataFrame(df2_dict, columns = ['id', 'Feature1', 'Feature2']) +``` + +1. Merge the two DataFrames to get this output: + + | | id | Feature1_x | Feature2_x | Feature1_y | Feature2_y | + |---:|-----:|:-------------|:-------------|:-------------|:-------------| + | 0 | 1 | A | B | K | L | + | 1 | 2 | C | D | M | N | + +2. Merge the two DataFrames to get this output: + + | | id | Feature1_df1 | Feature2_df1 | Feature1_df2 | Feature2_df2 | + |---:|-----:|:---------------|:---------------|:---------------|:---------------| + | 0 | 1 | A | B | K | L | + | 1 | 2 | C | D | M | N | + | 2 | 3 | E | F | nan | nan | + | 3 | 4 | G | H | nan | nan | + | 4 | 5 | I | J | nan | nan | + | 5 | 6 | nan | nan | O | P | + | 6 | 7 | nan | nan | Q | R | + | 7 | 8 | nan | nan | S | T | + diff --git a/one_exercise_per_file/day04/ex03/audit/readme.md b/one_exercise_per_file/day04/ex03/audit/readme.md new file mode 100644 index 0000000..00db420 --- /dev/null +++ b/one_exercise_per_file/day04/ex03/audit/readme.md @@ -0,0 +1,15 @@ +1. This question is validated if the outputted DataFrame's shape is `(1305, 5)` and if `merged.head()` returns a similar table: + +| | Open | Close | Close_Adjusted | Twitter | Reddit | +|:-----------------------------------------------------|-----------:|----------:|-----------------:|------------:|----------:| +| (Timestamp('2021-01-01 00:00:00', freq='B'), 'AAPL') | 0.0991792 | -0.31603 | 0.634787 | -0.00159041 | 1.06053 | +| (Timestamp('2021-01-01 00:00:00', freq='B'), 'FB') | -0.123753 | 1.00269 | 0.713264 | 0.0142127 | -0.487028 | +| (Timestamp('2021-01-01 00:00:00', freq='B'), 'GE') | -1.37775 | -1.01504 | 1.2858 | 0.109835 | 0.04273 | +| (Timestamp('2021-01-01 00:00:00', freq='B'), 'AMZN') | 1.06324 | 0.841241 | -0.799481 | -0.805677 | 0.511769 | +| (Timestamp('2021-01-01 00:00:00', freq='B'), 'DAI') | -0.603453 | -2.06141 | -0.969064 | 1.49817 | 0.730055 | + +One of the answers that returns the correct DataFrame is: + +`market_data.merge(alternative_data, how='left', left_index=True, right_index=True)` + +2. This question is validated if the numbers that are missing in the DataFrame are equal to 0 and if `filled_df.sum().sum() == merged_df.sum().sum()` gives: `True` diff --git a/one_exercise_per_file/day04/ex03/readme.md b/one_exercise_per_file/day04/ex03/readme.md new file mode 100644 index 0000000..24ed6be --- /dev/null +++ b/one_exercise_per_file/day04/ex03/readme.md @@ -0,0 +1,34 @@ +# Exercise 3 Merge MultiIndex + +The goal of this exercise is to learn to merge DataFrames with MultiIndex. +Use the code below to generate the DataFrames. `market_data` contains fake market data. In finance, the market is available during the trading days (business days). `alternative_data` contains fake alternative data from social media. This data is available every day. But, for some reasons the Data Engineer lost the last 15 days of alternative data. + +1. Using `market_data` as the reference, merge `alternative_data` on `market_data` + + ```python + #generate days + all_dates = pd.date_range('2021-01-01', '2021-12-15') + business_dates = pd.bdate_range('2021-01-01', '2021-12-31') + + #generate tickers + tickers = ['AAPL', 'FB', 'GE', 'AMZN', 'DAI'] + + #create indexs + index_alt = pd.MultiIndex.from_product([all_dates, tickers], names=['Date', 'Ticker']) + index = pd.MultiIndex.from_product([business_dates, tickers], names=['Date', 'Ticker']) + + # create DFs + market_data = pd.DataFrame(index=index, + data=np.random.randn(len(index), 3), + columns=['Open','Close','Close_Adjusted']) + + alternative_data = pd.DataFrame(index=index_alt, + data=np.random.randn(len(index_alt), 2), + columns=['Twitter','Reddit']) + ``` + +`reset_index` is not allowed for this question + +2. Fill missing values with 0 + +- https://medium.com/swlh/merging-dataframes-with-pandas-pd-merge-7764c7e2d46d diff --git a/one_exercise_per_file/day04/ex04/audit/readme.md b/one_exercise_per_file/day04/ex04/audit/readme.md new file mode 100644 index 0000000..37ef49b --- /dev/null +++ b/one_exercise_per_file/day04/ex04/audit/readme.md @@ -0,0 +1,59 @@ +The for loop is forbidden in this exercise. The goal is to use `groupby` and `apply`. + +1. This question is validated if the output is: + + ```python + df = pd.DataFrame(range(1,11), columns=['sequence']) + print(winsorize(df, [0.20, 0.80]).to_markdown()) + ``` + + | | sequence | + |---:|-----------:| + | 0 | 2.8 | + | 1 | 2.8 | + | 2 | 3 | + | 3 | 4 | + | 4 | 5 | + | 5 | 6 | + | 6 | 7 | + | 7 | 8 | + | 8 | 8.2 | + | 9 | 8.2 | + +2. This question is validated if the output is the same as the one returned by: + + ```python + def winsorize(df_series, quantiles): + """ + df: pd.DataFrame or pd.Series + quantiles: list [0.05, 0.95] + + """ + min_value = np.quantile(df_series, quantiles[0]) + max_value = np.quantile(df_series, quantiles[1]) + + return df_series.clip(lower = min_value, upper = max_value) + + + df.groupby("group")[['sequence']].apply(winsorize, [0.05,0.95]) + ``` + + The output can also be a Series instead of a DataFrame. + + The expected output (first rows) is: + + | | sequence | + |---:|-----------:| + | 0 | 1.45 | + | 1 | 2 | + | 2 | 3 | + | 3 | 4 | + | 4 | 5 | + | 5 | 6 | + | 6 | 7 | + | 7 | 8 | + | 8 | 9 | + | 9 | 9.55 | + | 10 | 11.45 | + +- https://towardsdatascience.com/how-to-use-the-split-apply-combine-strategy-in-pandas-groupby-29e0eb44b62e diff --git a/one_exercise_per_file/day04/ex04/readme.md b/one_exercise_per_file/day04/ex04/readme.md new file mode 100644 index 0000000..2c53b0a --- /dev/null +++ b/one_exercise_per_file/day04/ex04/readme.md @@ -0,0 +1,65 @@ +# Exercise 4 Groupby Apply + +The goal of this exercise is to learn to group the data and apply a function on the groups. +The use case we will work on is computing + +1. Create a function that uses `pandas.DataFrame.clip` and that replace extreme values by a given percentile. The values that are greater than the upper percentile 80% are replaced by the percentile 80%. The values that are smaller than the lower percentile 20% are replaced by the percentile 20%. This process that correct outliers is called **winsorizing**. +I recommend to use NumPy to compute the percentiles to make sure we used the same default parameters. + + ```python + def winsorize(df, quantiles): + """ + df: pd.DataFrame + quantiles: list + ex: [0.05, 0.95] + """ + #TODO + return + ``` + + Here is what the function should output: + + ```python + df = pd.DataFrame(range(1,11), columns=['sequence']) + print(winsorize(df, [0.20, 0.80]).to_markdown()) + + ``` + + | | sequence | + |---:|-----------:| + | 0 | 2.8 | + | 1 | 2.8 | + | 2 | 3 | + | 3 | 4 | + | 4 | 5 | + | 5 | 6 | + | 6 | 7 | + | 7 | 8 | + | 8 | 8.2 | + | 9 | 8.2 | + +2. Now we consider that each value belongs to a group. The goal is to apply the **winsorizing to each group**. In this question we use winsorizing values that are common: `[0.05,0.95]` as percentiles. Here is the new data set: + + ```python + groups = np.concatenate([np.ones(10), np.ones(10)+1, np.ones(10)+2, np.ones(10)+3, np.ones(10)+4]) + + df = pd.DataFrame(data= zip(groups, + range(1,51)), + columns=["group", "sequence"]) + ``` + + The expected output (first rows) is: + + | | sequence | + |---:|-----------:| + | 0 | 1.45 | + | 1 | 2 | + | 2 | 3 | + | 3 | 4 | + | 4 | 5 | + | 5 | 6 | + | 6 | 7 | + | 7 | 8 | + | 8 | 9 | + | 9 | 9.55 | + | 10 | 11.45 | diff --git a/one_exercise_per_file/day04/ex05/audit/readme.md b/one_exercise_per_file/day04/ex05/audit/readme.md new file mode 100644 index 0000000..e79190f --- /dev/null +++ b/one_exercise_per_file/day04/ex05/audit/readme.md @@ -0,0 +1,11 @@ +1. The question is validated if the output is: + +| product | ('value', 'min') | ('value', 'max') | ('value', 'mean') | +|:-------------|-------------------:|-------------------:|--------------------:| +| chair | 22.89 | 32.12 | 27.505 | +| mobile phone | 100 | 111.22 | 105.61 | +| table | 20.45 | 99.99 | 51.22 | + +Note: The columns don't have to be MultiIndex + +My answer is: `df.groupby('product').agg({'value':['min','max','mean']})` \ No newline at end of file diff --git a/one_exercise_per_file/day04/ex05/readme.md b/one_exercise_per_file/day04/ex05/readme.md new file mode 100644 index 0000000..4c9997c --- /dev/null +++ b/one_exercise_per_file/day04/ex05/readme.md @@ -0,0 +1,23 @@ +# Exercise 5 Groupby Agg + +The goal of this exercise is to learn to compute different type of aggregations on the groups. This small DataFrame contains products and prices. + +| | value | product | +|---:|--------:|:-------------| +| 0 | 20.45 | table | +| 1 | 22.89 | chair | +| 2 | 32.12 | chair | +| 3 | 111.22 | mobile phone | +| 4 | 33.22 | table | +| 5 | 100 | mobile phone | +| 6 | 99.99 | table | + +1. Compute the min, max and mean price for each product in one single line of code. The expected output is: + +| product | ('value', 'min') | ('value', 'max') | ('value', 'mean') | +|:-------------|-------------------:|-------------------:|--------------------:| +| chair | 22.89 | 32.12 | 27.505 | +| mobile phone | 100 | 111.22 | 105.61 | +| table | 20.45 | 99.99 | 51.22 | + +Note: The columns don't have to be MultiIndex \ No newline at end of file diff --git a/one_exercise_per_file/day04/ex06/audit/readme.md b/one_exercise_per_file/day04/ex06/audit/readme.md new file mode 100644 index 0000000..e78b541 --- /dev/null +++ b/one_exercise_per_file/day04/ex06/audit/readme.md @@ -0,0 +1,9 @@ +1. This questions is validated is the output is similar to `unstacked_df.head()`: + + | Date | ('Prediction', 'AAPL') | ('Prediction', 'AMZN') | ('Prediction', 'DAI') | ('Prediction', 'FB') | ('Prediction', 'GE') | + |:--------------------|-------------------------:|-------------------------:|------------------------:|-----------------------:|-----------------------:| + | 2021-01-01 00:00:00 | 0.382312 | -0.072392 | -0.551167 | -0.0585555 | 1.05955 | + | 2021-01-04 00:00:00 | -0.560953 | 0.503199 | -0.79517 | -3.23136 | 1.50271 | + | 2021-01-05 00:00:00 | 0.211489 | 1.84867 | 0.287906 | -1.81119 | 1.20321 | + +2. The question is validated if the answer is: `unstacked.plot(title = 'Stocks 2021')`. The title can be anything else. diff --git a/one_exercise_per_file/day04/ex06/readme.md b/one_exercise_per_file/day04/ex06/readme.md new file mode 100644 index 0000000..5334cf2 --- /dev/null +++ b/one_exercise_per_file/day04/ex06/readme.md @@ -0,0 +1,32 @@ +# Exercise 6 Unstack + +The goal of this exercise is to learn to unstack a MultiIndex +Let's assume we trained a machine learning model that predicts a daily score on the companies (tickers) below. It may be very useful to unstack the MultiIndex: plot the time series, vectorize the backtest, ... + +```python +business_dates = pd.bdate_range('2021-01-01', '2021-12-31') + +#generate tickers +tickers = ['AAPL', 'FB', 'GE', 'AMZN', 'DAI'] + +#create indexs +index = pd.MultiIndex.from_product([business_dates, tickers], names=['Date', 'Ticker']) + +# create DFs +market_data = pd.DataFrame(index=index, + data=np.random.randn(len(index), 1), + columns=['Prediction']) + +``` + +1. Unstack the DataFrame. + +The first 3 rows of the DataFrame should like this: + +| Date | ('Prediction', 'AAPL') | ('Prediction', 'AMZN') | ('Prediction', 'DAI') | ('Prediction', 'FB') | ('Prediction', 'GE') | +|:--------------------|-------------------------:|-------------------------:|------------------------:|-----------------------:|-----------------------:| +| 2021-01-01 00:00:00 | 0.382312 | -0.072392 | -0.551167 | -0.0585555 | 1.05955 | +| 2021-01-04 00:00:00 | -0.560953 | 0.503199 | -0.79517 | -3.23136 | 1.50271 | +| 2021-01-05 00:00:00 | 0.211489 | 1.84867 | 0.287906 | -1.81119 | 1.20321 | + +2. Plot the 5 times series in the same plot using Pandas built-in visualization functions with a title. \ No newline at end of file diff --git a/one_exercise_per_file/day04/readme.md b/one_exercise_per_file/day04/readme.md new file mode 100644 index 0000000..60ae7d3 --- /dev/null +++ b/one_exercise_per_file/day04/readme.md @@ -0,0 +1,25 @@ +# D04 Piscine AI - Data Science + +Author: + +# Table of Contents: + +Historical part: + +Data wrangling, unify source of data ... + +# Introduction + +... + +## Resources + +Pandas website + +- https://jakevdp.github.io/PythonDataScienceHandbook/ + +- https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf + +- https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/ + +- https://towardsdatascience.com/different-ways-to-iterate-over-rows-in-a-pandas-dataframe-performance-comparison-dc0d5dcef8fe