@ -31,18 +30,15 @@ The goal of this exercise is to learn to concatenate DataFrames. The logic is th
Here are the two DataFrames to concatenate:
```
```python
df1 = pd.DataFrame([['a', 1], ['b', 2]],
columns=['letter', 'number'])
df2 = pd.DataFrame([['c', 1], ['d', 2]],
columns=['letter', 'number'])
```
1. Concatenate this two DataFrames on index axis and reset the index. The index of the outputted should be `RangeIndex(start=0, stop=4, step=1)`. **Do not change the index manually**.
## Correction
1. This question is validated if the outputted DataFrame is:
Note: Check that the suffixes are set using the suffix parameters rather than manually changing the columns' name.
Note: Check that the suffixes are set using the suffix parameters rather than manually changing the columns' name.
## Exercise 3 Merge MultiIndex
The goal of this exercise is to learn to merge DataFrames with MultiIndex.
Use the code below to generate the DataFrames. `market_data` contains fake market data. In finance, the market is available during the trading days (business days). `alternative_data` contains fake alternative data from social media. This data is available every day. But, for some reasons the Data Engineer lost the last 15 days of alternative data.
The goal of this exercise is to learn to merge DataFrames with MultiIndex.
Use the code below to generate the DataFrames. `market_data` contains fake market data. In finance, the market is available during the trading days (business days). `alternative_data` contains fake alternative data from social media. This data is available every day. But, for some reasons the Data Engineer lost the last 15 days of alternative data.
1. Using `market_data` as the reference, merge `alternative_data` on `market_data`
2. This question is validated if the number of missing in the DataFrame is equal to 0 and if `filled_df.sum().sum() == merged_df.sum().sum()` gives: `True`
2. This question is validated if the numbers that are missing in the DataFrame are equal to 0 and if `filled_df.sum().sum() == merged_df.sum().sum()` gives: `True`
# Exercise 4 Groupby Apply
The goal of this exercise is to learn to group the data and apply a function on the groups.
The use case we will work on is computing
The goal of this exercise is to learn to group the data and apply a function on the groups.
The use case we will work on is computing
1. Create a function that uses `pandas.DataFrame.clip` and that replace extreme values by a given percentile. The values that are greater than the upper percentile 80% are replaced by the percentile 80%. The values that are smaller than the lower percentile 20% are replaced by the percentile 20%. This process that correct outliers is called **winsorizing**.
I recommend to use NumPy to compute the percentiles to make sure we used the same defaut parameters.
1. Create a function that uses `pandas.DataFrame.clip` and that replace extreme values by a given percentile. The values that are greater than the upper percentile 80% are replaced by the percentile 80%. The values that are smaller than the lower percentile 20% are replaced by the percentile 20%. This process that correct outliers is called **winsorizing**.
I recommend to use NumPy to compute the percentiles to make sure we used the same default parameters.
```
```python
def winsorize(df, quantiles):
"""
df: pd.DataFrame
@ -201,15 +191,15 @@ I recommend to use NumPy to compute the percentiles to make sure we used the sam
@ -223,16 +213,16 @@ I recommend to use NumPy to compute the percentiles to make sure we used the sam
| 8 | 8.2 |
| 9 | 8.2 |
2. Now we consider that each value belongs to a group. The goal is to apply the **winsorizing to each group**. In this question we use winsorizing values that are common: `[0.05,0.95]` as percentiles. Here is the new data set:
2. Now we consider that each value belongs to a group. The goal is to apply the **winsorizing to each group**. In this question we use winsorizing values that are common: `[0.05,0.95]` as percentiles. Here is the new data set:
```
```python
groups = np.concatenate([np.ones(10), np.ones(10)+1, np.ones(10)+2, np.ones(10)+3, np.ones(10)+4])
df = pd.DataFrame(data= zip(groups,
range(1,51)),
columns=["group", "sequence"])
```
The expected output (first rows) is:
| | sequence |
@ -249,19 +239,17 @@ I recommend to use NumPy to compute the percentiles to make sure we used the sam
| 9 | 9.55 |
| 10 | 11.45 |
## Correction
The for loop is forbidden in this exercise. The goal is to use `groupby` and `apply`.
1. This question is validated if the output is:
The for loop is forbidden in this exercise. The goal is to use `groupby` and `apply`.
@ -353,12 +339,12 @@ Note: The columns don't have to be MultiIndex
My answer is: `df.groupby('product').agg({'value':['min','max','mean']})`
# Exercise 6 Unstack
# Exercise 6 Unstack
The goal of this exercise is to learn to unstack a MultiIndex.
Let's assume we trained a machine learning model that predicts a daily score on the companies (tickers) below. It may be very useful to unstack the MultiIndex: plot the time series, vectorize the backtest etc ...
The goal of this exercise is to learn to unstack a MultiIndex
Let's assume we trained a machine learning model that predicts a daily score on the companies (tickers) below. It may be very useful to unstack the MultiIndex: plot the time series, vectorize the backtest, ...