Browse Source

Merge pull request #2 from 01-edu/day-02-testing

day2: testing and feedback
pull/8/head
brad-gh 3 years ago committed by GitHub
parent
commit
71ac1f0cd7
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
  1. 0
      one_md_per_day_format/piscine/Week1/data/D02/ex3/Ecommerce_purchases.txt
  2. 0
      one_md_per_day_format/piscine/Week1/data/D02/ex4/iris.csv
  3. 0
      one_md_per_day_format/piscine/Week1/data/D02/ex4/iris.data
  4. 218
      one_md_per_day_format/piscine/Week1/day2.md

0
one_md_per_day_format/piscine/Week1/data/D02/ex4/Ecommerce_purchases.txt → one_md_per_day_format/piscine/Week1/data/D02/ex3/Ecommerce_purchases.txt

0
one_md_per_day_format/piscine/Week1/data/D02/ex3/iris.csv → one_md_per_day_format/piscine/Week1/data/D02/ex4/iris.csv

1 sepal_length sepal_width petal_length petal_width flower
2 0 5.1 3.5 1.4 0.2 Iris-setosa
3 1 4.9 3.0 1.4 0.2 Iris-setosa
4 2 4.7 3.2 1.3 0.2 Iris-setosa
5 3 4.6 3.1 1.5 0.2 Iris-setosa
6 4 5.0 -3.6 -1.4 0.2 Iris-setosa
7 5 5.4 3.9 1.7 0.4 Iris-setosa
8 6 4.6 3.4 1.4 0.3 Iris-setosa
9 7 5.0 3.4 1.5 0.2 Iris-setosa
10 8 -4.4 2.9 1400.0 0.2 Iris-setosa
11 9 4.9 3.1 1.5 0.1 Iris-setosa
12 10 5.4 3.7 0.2 Iris-setosa
13 11 4.8 3.4 0.2 Iris-setosa
14 12 4.8 3.0 0.1 Iris-setosa
15 13 4.3 3.0 0.1 Iris-setosa
16 14 5.8 4.0 0.2 Iris-setosa
17 15 5.7 4.4 0.4 Iris-setosa
18 16 5.4 3.9 0.4 Iris-setosa
19 17 5.1 3.5 0.3 Iris-setosa
20 18 5.7 3.8 0.3 Iris-setosa
21 19 5.1 3.8 0.3 Iris-setosa
22 20 5.4 3.4 0.2 Iris-setosa
23 21 5.1 3.7 0.4 Iris-setosa
24 22 4.6 3.6 0.2 Iris-setosa
25 23 5.1 3.3 0.5 Iris-setosa
26 24 4.8 3.4 0.2 Iris-setosa
27 25 5.0 -3.0 0.2 Iris-setosa
28 26 5.0 3.4 0.4 Iris-setosa
29 27 5.2 3.5 0.2 Iris-setosa
30 28 5.2 3.4 0.2 Iris-setosa
31 29 4.7 3.2 0.2 Iris-setosa
32 30 4.8 3.1 1.6 0.2 Iris-setosa
33 31 5.4 3.4 1.5 0.4 Iris-setosa
34 32 5.2 4.1 1.5 0.1 Iris-setosa
35 33 5.5 4.2 1.4 0.2 Iris-setosa
36 34 4.9 3.1 1.5 0.1 Iris-setosa
37 35 5.0 3.2 1.2 0.2 Iris-setosa
38 36 5.5 3.5 1.3 0.2 Iris-setosa
39 37 4.9 1.5 0.1 Iris-setosa
40 38 4.4 3.0 1.3 0.2 Iris-setosa
41 39 5.1 3.4 1.5 0.2 Iris-setosa
42 40 5.0 3.5 1.3 0.3 Iris-setosa
43 41 4.5 2.3 1.3 0.3 Iris-setosa
44 42 4.4 3.2 1.3 0.2 Iris-setosa
45 43 5.0 3.5 1.6 0.6 Iris-setosa
46 44 5.1 3.8 1.9 0.4 Iris-setosa
47 45 4.8 3.0 1.4 0.3 Iris-setosa
48 46 5.1 3809.0 1.6 0.2 Iris-setosa
49 47 4.6 3.2 1.4 0.2 Iris-setosa
50 48 5.3 3.7 1.5 0.2 Iris-setosa
51 49 5.0 3.3 1.4 0.2 Iris-setosa
52 50 7.0 3.2 4.7 1.4 Iris-versicolor
53 51 6.4 3200.0 4.5 1.5 Iris-versicolor
54 52 6.9 3.1 4.9 1.5 Iris-versicolor
55 53 5.5 2.3 4.0 1.3 Iris-versicolor
56 54 6.5 2.8 4.6 1.5 Iris-versicolor
57 55 5.7 2.8 4.5 1.3 Iris-versicolor
58 56 6.3 3.3 4.7 1600.0 Iris-versicolor
59 57 4.9 2.4 3.3 1.0 Iris-versicolor
60 58 6.6 2.9 4.6 1.3 Iris-versicolor
61 59 5.2 2.7 3.9 Iris-versicolor
62 60 5.0 2.0 3.5 1.0 Iris-versicolor
63 61 5.9 3.0 4.2 1.5 Iris-versicolor
64 62 6.0 2.2 4.0 1.0 Iris-versicolor
65 63 6.1 2.9 4.7 1.4 Iris-versicolor
66 64 5.6 2.9 3.6 1.3 Iris-versicolor
67 65 6.7 3.1 4.4 1.4 Iris-versicolor
68 66 5.6 3.0 4.5 1.5 Iris-versicolor
69 67 5.8 2.7 4.1 1.0 Iris-versicolor
70 68 6.2 2.2 4.5 1.5 Iris-versicolor
71 69 5.6 2.5 3.9 1.1 Iris-versicolor
72 70 5.9 3.2 4.8 1.8 Iris-versicolor
73 71 6.1 2.8 4.0 1.3 Iris-versicolor
74 72 6.3 2.5 4.9 1.5 Iris-versicolor
75 73 6.1 2.8 4.7 1.2 Iris-versicolor
76 74 6.4 2.9 4.3 1.3 Iris-versicolor
77 75 6.6 3.0 4.4 1.4 Iris-versicolor
78 76 6.8 2.8 4.8 1.4 Iris-versicolor
79 77 6.7 3.0 5.0 1.7 Iris-versicolor
80 78 6.0 2.9 4.5 1.5 Iris-versicolor
81 79 5.7 2.6 3.5 1.0 Iris-versicolor
82 80 5.5 2.4 3.8 1.1 Iris-versicolor
83 81 5.5 2.4 3.7 1.0 Iris-versicolor
84 82 5.8 2.7 3.9 1.2 Iris-versicolor
85 83 6.0 2.7 5.1 1.6 Iris-versicolor
86 84 5.4 3.0 4.5 1.5 Iris-versicolor
87 85 6.0 3.4 4.5 1.6 Iris-versicolor
88 86 6.7 3.1 4.7 1.5 Iris-versicolor
89 87 6.3 2.3 4.4 1.3 Iris-versicolor
90 88 5.6 3.0 4.1 1.3 Iris-versicolor
91 89 5.5 2.5 4.0 1.3 Iris-versicolor
92 90 5.5 2.6 4.4 1.2 Iris-versicolor
93 91 6.1 3.0 4.6 1.4 Iris-versicolor
94 92 5.8 2.6 4.0 1.2 Iris-versicolor
95 93 5.0 2.3 3.3 1.0 Iris-versicolor
96 94 5.6 2.7 4.2 1.3 Iris-versicolor
97 95 5.7 3.0 4.2 1.2 Iris-versicolor
98 96 5.7 2.9 4.2 1.3 Iris-versicolor
99 97 6.2 2.9 4.3 1.3 Iris-versicolor
100 98 5.1 2.5 3.0 1.1 Iris-versicolor
101 99 5.7 2.8 1.3 Iris-versicolor
102 100 3.3 2.5 Iris-virginica
103 101 5.8 2.7 1.9 Iris-virginica
104 102 7.1 3.0 2.1 Iris-virginica
105 103 6.3 2.9 1.8 Iris-virginica
106 104 6.5 3.0 2.2 Iris-virginica
107 105 7.6 3.0 6.6 2.1 Iris-virginica
108 106 4.9 2.5 4.5 1.7 Iris-virginica
109 107 7.3 2.9 6.3 1.8 Iris-virginica
110 108 6.7 2.5 5.8 1.8 Iris-virginica
111 109 7.2 3.6 6.1 2.5 Iris-virginica
112 110 6.5 3.2 5.1 2.0 Iris-virginica
113 111 6.4 2.7 5.3 1.9 Iris-virginica
114 112 6.8 3.0 5.5 2.1 Iris-virginica
115 113 5.7 2.5 5.0 2.0 Iris-virginica
116 114 5.8 5.1 2.4 Iris-virginica
117 115 6.4 5.3 2.3 Iris-virginica
118 116 6.5 5.5 1.8 Iris-virginica
119 117 7.7 6.7 2.2 Iris-virginica
120 118 7.7 2.3 Iris-virginica
121 119 6.0 5.0 1.5 Iris-virginica
122 120 6.9 5.7 2.3 Iris-virginica
123 121 5.6 2.8 4.9 2.0 Iris-virginica
124 122 always check the data !!!!!!!!
125 123 6.3 2.7 4.9 1.8 Iris-virginica
126 124 6.7 3.3 5.7 2.1 Iris-virginica
127 125 7.2 3.2 6.0 1.8 Iris-virginica
128 126 6.2 2.8 -4.8 1.8 Iris-virginica
129 127 3.0 4.9 1.8 Iris-virginica
130 128 6.4 2.8 5.6 2.1 Iris-virginica
131 129 7.2 3.0 5.8 1.6 Iris-virginica
132 130 7.4 2.8 6.1 1.9 Iris-virginica
133 131 7.9 3.8 6.4 2.0 Iris-virginica
134 132 6.-4 2.8 5.6 2.2 Iris-virginica
135 133 6.3 2.8 1.5 Iris-virginica
136 134 6.1 2.6 5.6 1.4 Iris-virginica
137 135 7.7 3.0 6.1 2.3 Iris-virginica
138 136 6.3 3.4 5.6 2.4 Iris-virginica
139 137 6.4 3.1 5.5 1.8 Iris-virginica
140 138 6.0 3.0 4.8 1.8 Iris-virginica
141 139 6900 3.1 5.4 2.1 Iris-virginica
142 140 6.7 3.1 2.4 Iris-virginica
143 141 6.9 3.1 5.1 2.3 Iris-virginica
144 142 580 2.7 5.1 Iris-virginica
145 143 6.8 3.2 5.9 2.3 Iris-virginica
146 144 6.7 3.3 5.7 -2.5 Iris-virginica
147 145 6.7 3.0 5.2 2.3 Iris-virginica
148 146 6.3 2.5 5.0 1.9 Iris-virginica
149 147 6.5 3.0 5.2 2.0 Iris-virginica
150 148 6.2 3.4 5.4 2.3 Iris-virginica
151 149 5.9 3.0 5.1 1.8 Iris-virginica

0
one_md_per_day_format/piscine/Week1/data/D02/ex3/iris.data → one_md_per_day_format/piscine/Week1/data/D02/ex4/iris.data

218
one_md_per_day_format/piscine/Week1/day2.md

@ -1,51 +1,52 @@
# D02 Piscine AI - Data Science
# D02 Piscine AI - Data Science
Author:
Author:
# Table of Contents:
Historical part:
Historical part:
# Introduction
The goal of this day is to understand practical usage of Pandas.
As Pandas in intensively used in Data Science, other days of the piscine will be dedidated to it.
The goal of this day is to understand practical usage of Pandas.
As Pandas in intensively used in Data Science, other days of the piscine will be dedicated to it.
Not only is the Pandas library a central component of the data science toolkit but it is used in conjunction with other libraries in that collection.
Pandas is built on top of the NumPy package, meaning a lot of the structure of NumPy is used or replicated in Pandas. Data in pandas is often used to feed statistical analysis in SciPy, plotting functions from Matplotlib, and machine learning algorithms in Scikit-learn.
Most of the topics we will cover today are explained and describes with examples in the first ressource. The number of exercises is low on purpose: Take the time to understand the chapter 5 of the ressource, even if there are 40 pages.
Most of the topics we will cover today are explained and describes with examples in the first resource. The number of exercises is low on purpose: Take the time to understand the chapter 5 of the resource, even if there are 40 pages.
The version of Pandas I used is '1.0.1'.
## Rules
...
## Ressources
- If I had to give you one ressource it would be this one:
...
## Resources
- If I had to give you one resource it would be this one:
https://bedford-computing.co.uk/learning/wp-content/uploads/2015/10/Python-for-Data-Analysis.pdf
It contains ALL you need to know about Pandas.
It contains ALL you need to know about Pandas.
- Pandas documentation:
https://pandas.pydata.org/docs/
- https://pandas.pydata.org/docs/
- https://jakevdp.github.io/PythonDataScienceHandbook/
- https://jakevdp.github.io/PythonDataScienceHandbook/
- https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/
- https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/
https://jakevdp.github.io/PythonDataScienceHandbook/03.04-missing-values.html
# Exercise 1
- https://jakevdp.github.io/PythonDataScienceHandbook/03.04-missing-values.html
The goal of this exercise is to learn to create basic Pandas objects.
# Exercice 1
1. Create a DataFrame as below this using two ways:
The goal of this exercise is to learn to create basic Pandas objects.
1. Create a DataFrame as below this using two ways:
- From a NumPy array
- From a Pandas Series
@ -57,45 +58,39 @@ The goal of this exercise is to learn to create basic Pandas objects.
| 7 | Grey | [7, 8] | 4.4 |
| 9 | Black | [9, 10] | 5.5 |
2. Print the types for every columns and the types of the first value of every columns
## Solution
1. The solution is accepted if the DataFrame created is the same as the "model" DataFrame. Check that the index is not 1,2,3,4,5.
## Solution
2. The solution is accepted if the types you get for the columns are
1. The solution is accepted if the DataFrame created is the same as the "model" DataFrame. Check that the index is not 1,2,3,4,5.
```
2. The solution is accepted if the types you get for the columns are
```console
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
```
and if the types of the first value of the columns are
and if the types of the first value of the columns are
```
```console
<class 'str'>
<class 'list'>
<class 'float'>
```
# Exercise 2 **Electric power consumption**
The goal of this exercise is to learn to manipulate real data with Pandas.
The goal of this exercise is to learn to manipulate real data with Pandas.
The data set used is **Individual household electric power consumption**
1. Delete the columns `Time`, `Sub_metering_2` and `Sub_metering_3`
2. Set `Date` as index
3. Create a function that takes as input the DataFrame with the data set and returns a DataFrame with updated types:
2. Set `Date` as index
3. Create a function that takes as input the DataFrame with the data set and returns a DataFrame with updated types:
```
```python
def update_types(df):
#TODO
return df
@ -103,39 +98,31 @@ The data set used is **Individual household electric power consumption**
4. Use `describe` to have an overview on the data set
5. Delete the rows with missing values
6. Modify `Sub_metering_1` by multplying it by 0.06
7. Select all the rows for which the Date is greater than 2008-12-27 and `Voltage` is greater than 242
6. Modify `Sub_metering_1` by adding 1 to it and multiplying the total by 0.06. If x is a row the output is: (x+1)*0.06
7. Select all the rows for which the Date is greater or equal than 2008-12-27 and `Voltage` is greater or equal than 242
8. Print the 88888th row.
9. What is the date for which the `Global_active_power` is maximal ?
9. What is the date for which the `Global_active_power` is maximal ?
10. Sort the first three columns by descending order of `Global_active_power` and ascending order of `Voltage`.
11. Compute the daily average of `Global_active_power`.
11. Compute the daily average of `Global_active_power`.
## Correction:
1. `del` works but it is not a solution I recommand. For this exercise it is accepted. It is expected to use `drop` with `axis=1`. `inplace=True` may be useful to avoid to affect the result to a variable.
## Correction
2. The prefered solution is `set_index` with `inplace=True`. As long as the DataFrame returns the output below, the solution is accepted. If the type of the index is not `dtype='datetime64[ns]'` the solution is not accepted.
1. `del` works but it is not a solution I recommend. For this exercise it is accepted. It is expected to use `drop` with `axis=1`. `inplace=True` may be useful to avoid to affect the result to a variable.
2. The preferred solution is `set_index` with `inplace=True`. As long as the DataFrame returns the output below, the solution is accepted. If the type of the index is not `dtype='datetime64[ns]'` the solution is not accepted.
```
```python
Input: df.head().index
Output:
DatetimeIndex(['2006-12-16', '2006-12-16','2006-12-16', '2006-12-16','2006-12-16'],
dtype='datetime64[ns]', name='Date', freq=None)
```
3. The prefered solution is `pd.to_numeric` with `coerce=True`. The solution is accepted if all types are `float64`.
3. The preferred solution is `pd.to_numeric` with `coerce=True`. The solution is accepted if all types are `float64`.
```
```python
Input: df.dtypes
Output:
@ -151,17 +138,16 @@ The data set used is **Individual household electric power consumption**
4. `df.describe()` is expected
5. You should have noticed that 25979 rows contain missing values (for a total of 129895). `df.isna().sum()` allows to check the number of missing values and `df.dropna()` with `inplace=True`. The solution is accepted if you used `dropna` and have the same number of missing values
5. You should have noticed that 25979 rows contain missing values (for a total of 129895). `df.isna().sum()` allows to check the number of missing values and `df.dropna()` with `inplace=True` allows to remove the rows with missing values. The solution is accepted if you used `dropna` and have the number of missing values as 0.
6. Two solutions are accepted:
- `df.loc[:,'A'] = df['A'] * 0.06`
- Using `apply` and `df.loc[:,'A']` =
- `df.loc[:,'A'] = (df['A'] + 1) * 0.06`
- Using `apply`: `df.loc[:,'A'] = df.loc[:,'A'].apply(lambda x: (x+1)*0.06)`
You may wonder `df.loc[:,'A']` is required and if `df['A'] = ...` works too. The answer is no. This is important in Pandas. Depending on the version of Pandas, it may return a warning. The reason is that you are affecting a value to a **copy** of the DataFrame and not in the DataFrame.
You may wonder `df.loc[:,'A']` is required and if `df['A'] = ...` works too. The answer is no. This is important in Pandas. Depending on the version of Pandas, it may return a warning. The reason is that you are affecting a value to a **copy** of the DataFrame and not in the DataFrame.
More details: https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas
7. The solution is accepted as long as the output of `print(filtered_df.head().to_markdown())` is
7. The solution is accepted as long as the output of `print(filtered_df.head().to_markdown())` is:
| Date | Global_active_power | Global_reactive_power |
|:--------------------|----------------------:|------------------------:|
@ -173,9 +159,9 @@ The data set used is **Individual household electric power consumption**
Check that the number of rows is equal to **449667**.
8. The solution is accepted if output is
8. The solution is accepted if output is
```
```console
Global_active_power 0.254
Global_reactive_power 0.000
Voltage 238.350
@ -187,7 +173,7 @@ The data set used is **Individual household electric power consumption**
9. The solution is accepted if the output is `Timestamp('2009-02-22 00:00:00')`
10. The solution is accepted if the output for print(sorted_df.tail().to_markdown()) is
10. The solution is accepted if the output for `print(sorted_df.tail().to_markdown())` is
| Date | Global_active_power | Global_reactive_power | Voltage |
|:--------------------|----------------------:|------------------------:|----------:|
@ -197,10 +183,9 @@ The data set used is **Individual household electric power consumption**
| 2008-08-28 00:00:00 | 0.076 | 0 | 235.64 |
| 2008-12-08 00:00:00 | 0.076 | 0 | 236.5 |
11. The solution is based on `groupby` which creates groups based on the index `Date` and aggregates the groups using the `mean`. The solution is accepted if the output is
11. The solution is based on `groupby` which creates groups based on the index `Date` and agregates the groups using the `mean`. The solution is accepted if the output is
```
```console
Date
2006-12-16 3.053475
2006-12-17 2.354486
@ -216,16 +201,14 @@ The data set used is **Individual household electric power consumption**
Name: Global_active_power, Length: 1433, dtype: float64
```
# Exercice 3: E-commerce purchases<w>
The goal of this exercise is to learn to manipulate real data with Pandas. This exercise is less guided since the exercise 2 should have given you a nice introduction.
# Exercise 3: E-commerce purchases
The goal of this exercise is to learn to manipulate real data with Pandas. This exercise is less guided since the exercise 2 should have given you a nice introduction.
The data set used is **E-commerce purchases**.
The data set used is **E-commerce purchases**.
Questions:
1. How many rows and columns are there?
2. What is the average Purchase Price?
3. What were the highest and lowest purchase prices?
@ -240,14 +223,15 @@ Questions:
12. What are the top 5 most popular email providers/hosts (e.g. gmail.com, yahoo.com, etc...)
## Correction
The validate this exercise all answers should return the expected numerical value given in the correction AND uses Pandas. For example using NumPy to compute the mean doesn't respect the philosophy of the exercise which is to use Pandas.
1. How many rows and columns are there?**10000 entries**
To validate this exercise all answers should return the expected numerical value given in the correction AND uses Pandas. For example using NumPy to compute the mean doesn't respect the philosophy of the exercise which is to use Pandas.
1. How many rows and columns are there?**10000 entries** and **14 columns**
There many solutions based on: shape, info, describe
2. What is the average Purchase Price? **50.34730200000025**
Even if `np.mean` gives the solution, `df['Purchase Price'].mean()` is preferred
3. What were the highest and lowest purchase prices?
@ -280,18 +264,18 @@ The validate this exercise all answers should return the expected numerical valu
Designer, jewellery 27
There many ways to the solution but the goal of this question was to make you use `value_counts`
There many ways to the solution but the goal of this question was to make you use `value_counts`
8. Someone made a purchase that came from Lot: `"90 WT"` , what was the Purchase Price for this transaction? **75.1**
9. What is the email of the person with the following Credit Card Number: `4926535242672853`. **bondellen@williams-garza.com**
9. What is the email of the person with the following Credit Card Number: `4926535242672853`. **bondellen@williams-garza.com**
10. How many people have American Express as their Credit Card Provider and made a purchase above `$95` ? **39**
The prefered solution is based on this:
The prefered solution is based on this:
`df[(df['A'] == X) & (df['B'] > Y)]`
11. How many people have a credit card that expires in `2025`? **1033**
The prefered solution is based on the usage of `apply` on a `lambda` function that slices the string that contains the expiration date.
The preferred solution is based on the usage of `apply` on a `lambda` function that slices the string that contains the expiration date.
12. What are the top 5 most popular email providers/hosts (e.g. gmail.com, yahoo.com, etc...)
@ -301,48 +285,65 @@ The validate this exercise all answers should return the expected numerical valu
- smith.com 42
- williams.com 37
The prefered solution is based on the usage of `apply` on a `lambda` function that slices the string that contains the email. The `lambda` function uses `split` to split the string on `@`. Finally, `value_counts` is used to count the occurences.
The preferred solution is based on the usage of `apply` on a `lambda` function that slices the string that contains the email. The `lambda` function uses `split` to split the string on `@`. Finally, `value_counts` is used to count the occurrences.
# Exercice 4 Handling missing values
The goal of this exercise is to learn to handle missing values. In the previous exercise we used the first techniques: filter out the missing values. We were lucky because the proportion of missing values was low. But in some cases, dropping the missing values is not possible because the filtered data set would be too small.
# Exercise 3 Handling missing values
This article explains the different types of missing data and how they should be handled.
The goal of this exercise is to learn to handle missing values. In the previsous exercise we used the first techniques: filter out the missing values. We were lucky because the proportion of missing values was low. But in some cases, dropping the missing values is not possible because the filtered data set would be too small.
This article explains the different types of missing data and how they should be handled. https://towardsdatascience.com/data-cleaning-with-python-and-pandas-detecting-missing-values-3e9c6ebcf78b
https://towardsdatascience.com/data-cleaning-with-python-and-pandas-detecting-missing-values-3e9c6ebcf78b
"
**It’s important to understand these different types of missing data from a statistics point of view. The type of missing data will influence how you deal with filling in the missing values.**"
"**It’s important to understand these different types of missing data from a statistics point of view. The type of missing data will influence how you deal with filling in the missing values.**"
1. Drop the `flower` column
- Fill the missing values with a different "strategy" for each column:
- Preliminary: Drop the `flower` column
`sepal_length` -> `mean`
1. Fill the missing values with a different "strategy" for each column:
`sepal_length` -> `mean`
`sepal_width` -> `median`
`petal_length`, `petal_width` -> `0`
2. Explain why filling the missing values with 0 or the mean is a bad idea
3. Fill the missing values using the median
2. Fill the missing values using the median of the associated column using `fillna`.
- Bonus questions:
- Filling the missing values by 0 or the mean of the associated column is common in Data Science. In that case, explain why filling the missing values with 0 or the mean is a bad idea.
- Find a special row ;-)
## Correction
To validate the exercise, you should have done these two steps in that order:
- Convert the numerical columns to `float`
1. This question is validated if you have done these two steps in that order:
- Convert the numerical columns to `float`
```
example:
```python
pd.to_numeric(df.loc[:,col], errors='coerce')
```
- Fill the missing values. There are many solutions for this step, here is one of them.
```
- Fill the missing values. There are many solutions for this step, here is one of them.
example:
```python
df.fillna({0:df.sepal_length.mean(),
2:df.sepal_width.median(),
3:0,
4:0})
```
- It is important to understand why filling the missing values with 0 or the mean of the column is a bad idea.
2. This question is validated if the solution is: `df.loc[:,col].fillna(df[col].median())`
**Bonus questions**:
- It is important to understand why filling the missing values with 0 or the mean of the column is a bad idea.
| | sepal_length | sepal_width | petal_length | petal_width |
|:------|---------------:|--------------:|---------------:|--------------:|
@ -355,19 +356,10 @@ To validate the exercise, you should have done these two steps in that order:
| 75% | 6.4 | 3.3 | 5.1 | 1.8 |
| max | 6900 | 3809 | 1400 | 1600 |
Once we filled the missing values as suggested in the first question, `df.describe()` returns this intersting summary. We notice that the mean is way higher than the median. It means that there are maybe some outliers in the data. The quantile 75 and the max confirm that: 75% of the flowers have a sepal length smaller than 6.4 cm, but the max is 6900 cm. If you check on the internet you realise this small flower can't be that big. The outliers have a major impact on the mean which equals to 56.9. Filling this value for the missing value is not correct since it doesn't correspond to the real size of this flower. That's why in that case the best strategy to fill the missing values was the median. The truth is that I modified the data set ! But real data sets ALWAYS contains ouliers.
Bonus:
- If you noticed that there are some negative values and the huge values, you will be a good data scientist. **YOU SHOULD ALWAYS TRY TO UNDERSTAND YOUR DATA**. Print the row with index 122 ;-)
This week, we will have the opportunity to focus on the data pre-processing to understand how the outliers are handled.
Once we filled the missing values as suggested in the first question, `df.describe()` returns this interesting summary. We notice that the mean is way higher than the median. It means that there are maybe some outliers in the data. The quantile 75 and the max confirm that: 75% of the flowers have a sepal length smaller than 6.4 cm, but the max is 6900 cm. If you check on the internet you realise this small flower can't be that big. The outliers have a major impact on the mean which equals to 56.9. Filling this value for the missing value is not correct since it doesn't correspond to the real size of this flower. That is why in that case the best strategy to fill the missing values is the median. The truth is that I modified the data set ! But real data sets ALWAYS contains outliers.
Always think about the meaning of the data transformation ! If you fill the missing values by zero, it means that you consider that the length or width of some flowers may be 0. It doesn't make sense.
EXos à ajouter:
- If you noticed that there are some negative values and the huge values, you will be a good data scientist. **YOU SHOULD ALWAYS TRY TO UNDERSTAND YOUR DATA**. Print the row with index 122 ;-)
Créer une Series
train_test_split
Ajouter 3 exos sur les fontions natives incontournable de Pandas
This week, we will have the opportunity to focus on the data pre-processing to understand how the outliers are handled.
dropna
Loading…
Cancel
Save